I had asked some questions of Michael Hunger about CSV Imports, and per his advice moving the discussion thread to the Group. Here is the discussion below
Hi Michael, my name's Isaac Vargas, I attended GraphConnect this past October and have been diving head first into Neo4j. I've been making use of your batch importer to migrate a large SQL database (~200 million rows) and I've been successful up to a certain point, and was wondering if you had any quick pointers. The problem is I am exporting using SQLCMD to a UTF-8 CSV file in order to then import using your batch importer. The bulk of the data consists of social media post data, which includes a lot of Unicode/non ASCII characters, and I've noticed that when I attempt a test import of about 1 million post records, I am losing over half of them in the import, and consistently getting a return of 499035 records as opposed to my test pull of TOP 1000000. I have tried a slew of different flags and encoding params, but nothing seems to be working. Is there any trick to getting Unicode characters into Neo via the batch importer? Michael Hunger <[email protected]> Nov 10 (2 days ago) to me Hi Isaac, there might be two main issues: - not correctly quotes text, e.g. Michael,39,This is my blog "^) I want to have fun <- the quote will lead the parser to scan up to the next double quote into a single value - not correctly escaped quotes, e.g. Michael,39,"This is my blog "^) I want to have fun" <- the single quote in the middle should be doubled - delimiters in the text, without the text being quoted, e.g. Michael,39,This is my blog, I want to have fun <- comma in the text best is to test a sample (e.g. 200k rows) with csvstat of https://csvkit.readthedocs.org/en/0.9.0/ And look at the field sizes if there are fields that are suddenly huge. Perhaps you can share a sample of your data. Otherwise it helps to use csvgrep or grep to find potential issues in the csv. Another thing we ran into were files with binary zeros in it, you'd have to remove them with tr -d see also: jexp.de/blog/2014/10/load-cvs-with-success/ On Mon, 10 Nov 2014 16:01:23 -0600, Isaac Vargas <[email protected]> wrote: Isaac Vargas <[email protected]> 7:47 PM (14 hours ago) to Michael After making use of csvgrep/csvstat and awk, I was able to determine a couple things. In my test set of 200k records, I am able to determine with awk in a Git Bash shell on Windows 8 that there are indeed 200K rows with 1 row of headers, and that the data is formatted correctly through the 200K rows. However, when I move that file into my Ubuntu environment and run csvgrep to verify the same data, I am getting erratic results. In particular, for whatever reason, what is happening is that a number of rows are being concatenated into a single post_notes field, and not breaking into a new line in the Linux environment, which is causing it to skip several thousand rows at a time. I am unable to ascertain why this is the case. Replacing all of the quotes in the server with flat text does not fix the issue. I am moving the file into Linux because the import.bat fails me on my large dataset Michael Hunger 1:21 AM (8 hours ago) to me Did you try: - to convert windows linefeeds into unix ones? - to check for binary zeros in your file (see the blog post) It would be good to continue this conversation on the public neo4j google group as I can't do one to one support without everyone benefitting :) Cheers, Michael On Tue, 11 Nov 2014 19:47:35 -0600, Isaac Vargas <[email protected]> -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
