I had asked some questions of Michael Hunger about CSV Imports, and per his 
advice moving the discussion thread to the Group. Here is the discussion 
below

Hi Michael, my name's Isaac Vargas, I attended GraphConnect this past 
October and have been diving head first into Neo4j. I've been making use of 
your batch importer to migrate a large SQL database (~200 million rows) and 
I've been successful up to a certain point, and was wondering if you had 
any quick pointers.

The problem is I am exporting using SQLCMD to a UTF-8 CSV file in order to 
then import using your batch importer. The bulk of the data consists of 
social media post data, which includes a lot of Unicode/non ASCII 
characters, and I've noticed that when I attempt a test import of about 1 
million post records, I am losing over half of them in the import, and 
consistently getting a return of 499035 records as opposed to my test pull 
of TOP 1000000. I have tried a slew of different flags and encoding params, 
but nothing seems to be working.

Is there any trick to getting Unicode characters into Neo via the batch 
importer?
Michael Hunger <[email protected]>
Nov 10 (2 days ago)
to me
Hi Isaac,

there might be two main issues:

- not correctly quotes text, e.g. Michael,39,This is my blog "^) I want to
have fun <- the quote will lead the parser to scan up to the next double
quote into a single value
- not correctly escaped quotes, e.g. Michael,39,"This is my blog "^) I
want to have fun" <- the single quote in the middle should be doubled
- delimiters in the text, without the text being quoted, e.g.
Michael,39,This is my blog, I want to have fun <- comma in the text


best is to test a sample (e.g. 200k rows) with csvstat of
https://csvkit.readthedocs.org/en/0.9.0/

And look at the field sizes if there are fields that are suddenly huge.

Perhaps you can share a sample of your data.

Otherwise it helps to use csvgrep or grep to find potential issues in the
csv.

Another thing we ran into were files with binary zeros in it, you'd have
to remove them with tr -d


see also: jexp.de/blog/2014/10/load-cvs-with-success/


On Mon, 10 Nov 2014 16:01:23 -0600, Isaac Vargas <[email protected]>
wrote:
Isaac Vargas <[email protected]>
7:47 PM (14 hours ago)
to Michael
After making use of csvgrep/csvstat and awk, I was able to determine a 
couple things.

In my test set of 200k records, I am able to determine with awk in a Git 
Bash shell on Windows 8 that there are indeed 200K rows with 1 row of 
headers, and that the data is formatted correctly through the 200K rows. 
However, when I move that file into my Ubuntu environment and run csvgrep 
to verify the same data, I am getting erratic results. In particular, for 
whatever reason, what is happening is that a number of rows are being 
concatenated into a single post_notes field, and not breaking into a new 
line in the Linux environment, which is causing it to skip several thousand 
rows at a time. I am unable to ascertain why this is the case. Replacing 
all of the quotes in the server with flat text does not fix the issue.

I am moving the file into Linux because the import.bat fails me on my large 
dataset
Michael Hunger
1:21 AM (8 hours ago)
to me
Did you try:

- to convert windows linefeeds into unix ones?
- to check for binary zeros in your file (see the blog post)

It would be good to continue this conversation on the public neo4j google
group as I can't do one to one support without everyone benefitting :)

Cheers, Michael

On Tue, 11 Nov 2014 19:47:35 -0600, Isaac Vargas <[email protected]>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to