Did you try to run it with -Dfile.encoding=UTF-8 ? And perhaps an importer in Java/Groovy is more stable for reading the data?
See here: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/ On Wed, Nov 12, 2014 at 5:02 PM, Isaac Vargas <[email protected]> wrote: > I had asked some questions of Michael Hunger about CSV Imports, and per > his advice moving the discussion thread to the Group. Here is the > discussion below > > Hi Michael, my name's Isaac Vargas, I attended GraphConnect this past > October and have been diving head first into Neo4j. I've been making use of > your batch importer to migrate a large SQL database (~200 million rows) and > I've been successful up to a certain point, and was wondering if you had > any quick pointers. > > The problem is I am exporting using SQLCMD to a UTF-8 CSV file in order to > then import using your batch importer. The bulk of the data consists of > social media post data, which includes a lot of Unicode/non ASCII > characters, and I've noticed that when I attempt a test import of about 1 > million post records, I am losing over half of them in the import, and > consistently getting a return of 499035 records as opposed to my test pull > of TOP 1000000. I have tried a slew of different flags and encoding params, > but nothing seems to be working. > > Is there any trick to getting Unicode characters into Neo via the batch > importer? > Michael Hunger <[email protected]> > Nov 10 (2 days ago) > to me > Hi Isaac, > > there might be two main issues: > > - not correctly quotes text, e.g. Michael,39,This is my blog "^) I want to > have fun <- the quote will lead the parser to scan up to the next double > quote into a single value > - not correctly escaped quotes, e.g. Michael,39,"This is my blog "^) I > want to have fun" <- the single quote in the middle should be doubled > - delimiters in the text, without the text being quoted, e.g. > Michael,39,This is my blog, I want to have fun <- comma in the text > > > best is to test a sample (e.g. 200k rows) with csvstat of > https://csvkit.readthedocs.org/en/0.9.0/ > > And look at the field sizes if there are fields that are suddenly huge. > > Perhaps you can share a sample of your data. > > Otherwise it helps to use csvgrep or grep to find potential issues in the > csv. > > Another thing we ran into were files with binary zeros in it, you'd have > to remove them with tr -d > > > see also: jexp.de/blog/2014/10/load-cvs-with-success/ > > > On Mon, 10 Nov 2014 16:01:23 -0600, Isaac Vargas <[email protected]> > wrote: > Isaac Vargas <[email protected]> > 7:47 PM (14 hours ago) > to Michael > After making use of csvgrep/csvstat and awk, I was able to determine a > couple things. > > In my test set of 200k records, I am able to determine with awk in a Git > Bash shell on Windows 8 that there are indeed 200K rows with 1 row of > headers, and that the data is formatted correctly through the 200K rows. > However, when I move that file into my Ubuntu environment and run csvgrep > to verify the same data, I am getting erratic results. In particular, for > whatever reason, what is happening is that a number of rows are being > concatenated into a single post_notes field, and not breaking into a new > line in the Linux environment, which is causing it to skip several thousand > rows at a time. I am unable to ascertain why this is the case. Replacing > all of the quotes in the server with flat text does not fix the issue. > > I am moving the file into Linux because the import.bat fails me on my > large dataset > Michael Hunger > 1:21 AM (8 hours ago) > to me > Did you try: > > - to convert windows linefeeds into unix ones? > - to check for binary zeros in your file (see the blog post) > > It would be good to continue this conversation on the public neo4j google > group as I can't do one to one support without everyone benefitting :) > > Cheers, Michael > > On Tue, 11 Nov 2014 19:47:35 -0600, Isaac Vargas <[email protected]> > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
