Re: Question about loading IMDb dataset from CSV files

Ian Maxon Tue, 11 Jun 2024 17:03:25 -0700

 I believe so. I think FieldCursorForDelimitedDataParser just needs to be
refactored to allow some character other than quote to begin an escape, and
it should be able to parse this fine.
I'd be curious on other's thoughts as well, though. I am surprised we
haven't hit this yet from other sources.


On Jun 11, 2024 at 16:58:08, Mike Carey <dtab...@gmail.com> wrote:

> So suppose we make the long-term rule that we can't change any of the
> lines in the file (:-)) - as that's the customer's data - and want to be
> import all of it - what're the specific moves in the CSV game that are
> needed in terms of being able to swallow the IMDb data whole?  (To allow
> configurable escape, that is?)
>
> (As a workaround to be unblocked for testing/benchmarking I guess Mehnaz
> can break the no-changing lines rule in the very short term - but -
> that's not ideal because we want to talk to the owners of the benchmark
> she's using and say that we're using exactly their data.)
>
>
> On 6/11/24 4:13 PM, Ian Maxon wrote:
>
>   The problem is sort of multifaceted. DelimitedDataParser doesn't allow
>
> configuration of the escape character. QuotedLineRecordReader does, but it
>
> isn't parsing the fields. You also only get that if you specify
>
> "format"="csv", and not "delimited-text".
>
> The csv isn't compliant with what's stated in RFC4180. There, the escape
>
> character is "". This is what DelimitedDataParser follows. If the line is
>
> changed to use that ("" insteade of \"), it works fine.
>
> I think we should consider supporting configurable escape during parse,
>
> since it can't really be expected that CSV should follow that RFC strictly;
>
> it is somewhat of an ad-hoc format.
>
>
> On Jun 11, 2024 at 08:30:48, Mike Carey<dtab...@gmail.com>  wrote:
>
>
> > I’m told the relevant code is in QuotedLineRecordReader, that's where
>
> > CSV/TSV parsing takes place, so you can have a look at what is happening
>
> > there.  There’s also an undocumented escape flag there (which we need to
>
> > test and document).  Others will probably have more details…. 🙂
>
> >
>
> > On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin <
>
> > mehnaztabassum.ma...@email.ucr.edu> wrote:
>
> >
>
> > Hello everyone,
>
> >
>
> >
>
> > I am trying to load the IMDb dataset in AsterixDB. It seems that some of
>
> >
>
> > the rows end up with broken escaping and eventually not being inserted at
>
> >
>
> > all. For example, I used the syntax as follows:
>
> >
>
> >
>
> > LOAD DATASET movie_companies using localfs (
>
> >
>
> > ("path"=asterix_nc1://imdb-data/movie-companies.csv),
>
> >
>
> > ("format"="delimited-text"),("delimiter"=","), ("null"="")
>
> >
>
> > );
>
> >
>
> >
>
> > The schema is movie_companies (id: int, movie_id: int, company_id: int,
>
> >
>
> > company_type_id: int, note: string) and the CSV file contains the
> following
>
> >
>
> > row:
>
> >
>
> >
>
> > 13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of
>
> >
>
> > Alfred Hitchcock, Vol. One\")"
>
> >
>
> >
>
> > This row ends up not loading at all. The rest of the row with no such
>
> >
>
> > string input can be loaded successfully.
>
> >
>
> >
>
> > Any suggestions?
>
> >
>
> >
>
> > Thanks,
>
> >
>
> > Mehnaz
>
> >
>
> >
>
> >
>
>

Re: Question about loading IMDb dataset from CSV files

Reply via email to