Did you try to run it with -Dfile.encoding=UTF-8 ?

And perhaps an importer in Java/Groovy is more stable for reading the data?

See here:
http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/

On Wed, Nov 12, 2014 at 5:02 PM, Isaac Vargas <[email protected]> wrote:

> I had asked some questions of Michael Hunger about CSV Imports, and per
> his advice moving the discussion thread to the Group. Here is the
> discussion below
>
> Hi Michael, my name's Isaac Vargas, I attended GraphConnect this past
> October and have been diving head first into Neo4j. I've been making use of
> your batch importer to migrate a large SQL database (~200 million rows) and
> I've been successful up to a certain point, and was wondering if you had
> any quick pointers.
>
> The problem is I am exporting using SQLCMD to a UTF-8 CSV file in order to
> then import using your batch importer. The bulk of the data consists of
> social media post data, which includes a lot of Unicode/non ASCII
> characters, and I've noticed that when I attempt a test import of about 1
> million post records, I am losing over half of them in the import, and
> consistently getting a return of 499035 records as opposed to my test pull
> of TOP 1000000. I have tried a slew of different flags and encoding params,
> but nothing seems to be working.
>
> Is there any trick to getting Unicode characters into Neo via the batch
> importer?
> Michael Hunger <[email protected]>
> Nov 10 (2 days ago)
> to me
> Hi Isaac,
>
> there might be two main issues:
>
> - not correctly quotes text, e.g. Michael,39,This is my blog "^) I want to
> have fun <- the quote will lead the parser to scan up to the next double
> quote into a single value
> - not correctly escaped quotes, e.g. Michael,39,"This is my blog "^) I
> want to have fun" <- the single quote in the middle should be doubled
> - delimiters in the text, without the text being quoted, e.g.
> Michael,39,This is my blog, I want to have fun <- comma in the text
>
>
> best is to test a sample (e.g. 200k rows) with csvstat of
> https://csvkit.readthedocs.org/en/0.9.0/
>
> And look at the field sizes if there are fields that are suddenly huge.
>
> Perhaps you can share a sample of your data.
>
> Otherwise it helps to use csvgrep or grep to find potential issues in the
> csv.
>
> Another thing we ran into were files with binary zeros in it, you'd have
> to remove them with tr -d
>
>
> see also: jexp.de/blog/2014/10/load-cvs-with-success/
>
>
> On Mon, 10 Nov 2014 16:01:23 -0600, Isaac Vargas <[email protected]>
> wrote:
> Isaac Vargas <[email protected]>
> 7:47 PM (14 hours ago)
> to Michael
> After making use of csvgrep/csvstat and awk, I was able to determine a
> couple things.
>
> In my test set of 200k records, I am able to determine with awk in a Git
> Bash shell on Windows 8 that there are indeed 200K rows with 1 row of
> headers, and that the data is formatted correctly through the 200K rows.
> However, when I move that file into my Ubuntu environment and run csvgrep
> to verify the same data, I am getting erratic results. In particular, for
> whatever reason, what is happening is that a number of rows are being
> concatenated into a single post_notes field, and not breaking into a new
> line in the Linux environment, which is causing it to skip several thousand
> rows at a time. I am unable to ascertain why this is the case. Replacing
> all of the quotes in the server with flat text does not fix the issue.
>
> I am moving the file into Linux because the import.bat fails me on my
> large dataset
> Michael Hunger
> 1:21 AM (8 hours ago)
> to me
> Did you try:
>
> - to convert windows linefeeds into unix ones?
> - to check for binary zeros in your file (see the blog post)
>
> It would be good to continue this conversation on the public neo4j google
> group as I can't do one to one support without everyone benefitting :)
>
> Cheers, Michael
>
> On Tue, 11 Nov 2014 19:47:35 -0600, Isaac Vargas <[email protected]>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to