Re: Question about loading IMDb dataset from CSV files

Ian Maxon Thu, 13 Jun 2024 12:25:30 -0700

 There's a patch up on Gerrit now that should allow files like this to be
parsed. If you give the option ("escape"="\\") it should pass this down all
the way to the parser, which has been extended to allow characters other
than " to be an escape for " . I tried it on the IMDB dataset from that
benchmark and it appeared like it parsed all the lines, as long as there
are no multi-line quoted strings. I just did SELECT COUNT (*) ... vs 'wc
-l' .


On Jun 11, 2024 at 17:03:19, Ian Maxon <ima...@apache.org> wrote:

> I believe so. I think FieldCursorForDelimitedDataParser just needs to be
> refactored to allow some character other than quote to begin an escape, and
> it should be able to parse this fine.
> I'd be curious on other's thoughts as well, though. I am surprised we
> haven't hit this yet from other sources.
>
> On Jun 11, 2024 at 16:58:08, Mike Carey <dtab...@gmail.com> wrote:
>
> So suppose we make the long-term rule that we can't change any of the
>
> lines in the file (:-)) - as that's the customer's data - and want to be
>
> import all of it - what're the specific moves in the CSV game that are
>
> needed in terms of being able to swallow the IMDb data whole?  (To allow
>
> configurable escape, that is?)
>
>
> (As a workaround to be unblocked for testing/benchmarking I guess Mehnaz
>
> can break the no-changing lines rule in the very short term - but -
>
> that's not ideal because we want to talk to the owners of the benchmark
>
> she's using and say that we're using exactly their data.)
>
>
>
> On 6/11/24 4:13 PM, Ian Maxon wrote:
>
>
>   The problem is sort of multifaceted. DelimitedDataParser doesn't allow
>
>
> configuration of the escape character. QuotedLineRecordReader does, but it
>
>
> isn't parsing the fields. You also only get that if you specify
>
>
> "format"="csv", and not "delimited-text".
>
>
> The csv isn't compliant with what's stated in RFC4180. There, the escape
>
>
> character is "". This is what DelimitedDataParser follows. If the line is
>
>
> changed to use that ("" insteade of \"), it works fine.
>
>
> I think we should consider supporting configurable escape during parse,
>
>
> since it can't really be expected that CSV should follow that RFC strictly;
>
>
> it is somewhat of an ad-hoc format.
>
>
>
> On Jun 11, 2024 at 08:30:48, Mike Carey<dtab...@gmail.com>  wrote:
>
>
>
> > I’m told the relevant code is in QuotedLineRecordReader, that's where
>
>
> > CSV/TSV parsing takes place, so you can have a look at what is happening
>
>
> > there.  There’s also an undocumented escape flag there (which we need to
>
>
> > test and document).  Others will probably have more details…. 🙂
>
>
> >
>
>
> > On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin <
>
>
> > mehnaztabassum.ma...@email.ucr.edu> wrote:
>
>
> >
>
>
> > Hello everyone,
>
>
> >
>
>
> >
>
>
> > I am trying to load the IMDb dataset in AsterixDB. It seems that some of
>
>
> >
>
>
> > the rows end up with broken escaping and eventually not being inserted at
>
>
> >
>
>
> > all. For example, I used the syntax as follows:
>
>
> >
>
>
> >
>
>
> > LOAD DATASET movie_companies using localfs (
>
>
> >
>
>
> > ("path"=asterix_nc1://imdb-data/movie-companies.csv),
>
>
> >
>
>
> > ("format"="delimited-text"),("delimiter"=","), ("null"="")
>
>
> >
>
>
> > );
>
>
> >
>
>
> >
>
>
> > The schema is movie_companies (id: int, movie_id: int, company_id: int,
>
>
> >
>
>
> > company_type_id: int, note: string) and the CSV file contains the
>
> following
>
>
> >
>
>
> > row:
>
>
> >
>
>
> >
>
>
> > 13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of
>
>
> >
>
>
> > Alfred Hitchcock, Vol. One\")"
>
>
> >
>
>
> >
>
>
> > This row ends up not loading at all. The rest of the row with no such
>
>
> >
>
>
> > string input can be loaded successfully.
>
>
> >
>
>
> >
>
>
> > Any suggestions?
>
>
> >
>
>
> >
>
>
> > Thanks,
>
>
> >
>
>
> > Mehnaz
>
>
> >
>
>
> >
>
>
> >
>
>
>
>

Re: Question about loading IMDb dataset from CSV files

Reply via email to