Re: Question about loading IMDb dataset from CSV files

Mike Carey Tue, 11 Jun 2024 16:58:41 -0700

So suppose we make the long-term rule that we can't change any of thelines in the file (:-)) - as that's the customer's data - and want to beimport all of it - what're the specific moves in the CSV game that areneeded in terms of being able to swallow the IMDb data whole? (To allowconfigurable escape, that is?)

(As a workaround to be unblocked for testing/benchmarking I guess Mehnazcan break the no-changing lines rule in the very short term - but -that's not ideal because we want to talk to the owners of the benchmarkshe's using and say that we're using exactly their data.)



On 6/11/24 4:13 PM, Ian Maxon wrote:

  The problem is sort of multifaceted. DelimitedDataParser doesn't allow
configuration of the escape character. QuotedLineRecordReader does, but it
isn't parsing the fields. You also only get that if you specify
"format"="csv", and not "delimited-text".
The csv isn't compliant with what's stated in RFC4180. There, the escape
character is "". This is what DelimitedDataParser follows. If the line is
changed to use that ("" insteade of \"), it works fine.
I think we should consider supporting configurable escape during parse,
since it can't really be expected that CSV should follow that RFC strictly;
it is somewhat of an ad-hoc format.

On Jun 11, 2024 at 08:30:48, Mike Carey<[email protected]>  wrote:

I’m told the relevant code is in QuotedLineRecordReader, that's where
CSV/TSV parsing takes place, so you can have a look at what is happening
there.  There’s also an undocumented escape flag there (which we need to
test and document).  Others will probably have more details…. 🙂

On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin <
[email protected]> wrote:

Hello everyone,


I am trying to load the IMDb dataset in AsterixDB. It seems that some of

the rows end up with broken escaping and eventually not being inserted at

all. For example, I used the syntax as follows:


LOAD DATASET movie_companies using localfs (

("path"=asterix_nc1://imdb-data/movie-companies.csv),

("format"="delimited-text"),("delimiter"=","), ("null"="")

);


The schema is movie_companies (id: int, movie_id: int, company_id: int,

company_type_id: int, note: string) and the CSV file contains the following

row:


13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of

Alfred Hitchcock, Vol. One\")"


This row ends up not loading at all. The rest of the row with no such

string input can be loaded successfully.


Any suggestions?


Thanks,

Mehnaz

Re: Question about loading IMDb dataset from CSV files

Reply via email to