There's a patch up on Gerrit now that should allow files like this to be parsed. If you give the option ("escape"="\\") it should pass this down all the way to the parser, which has been extended to allow characters other than " to be an escape for " . I tried it on the IMDB dataset from that benchmark and it appeared like it parsed all the lines, as long as there are no multi-line quoted strings. I just did SELECT COUNT (*) ... vs 'wc -l' .
On Jun 11, 2024 at 17:03:19, Ian Maxon <ima...@apache.org> wrote: > I believe so. I think FieldCursorForDelimitedDataParser just needs to be > refactored to allow some character other than quote to begin an escape, and > it should be able to parse this fine. > I'd be curious on other's thoughts as well, though. I am surprised we > haven't hit this yet from other sources. > > On Jun 11, 2024 at 16:58:08, Mike Carey <dtab...@gmail.com> wrote: > > So suppose we make the long-term rule that we can't change any of the > > lines in the file (:-)) - as that's the customer's data - and want to be > > import all of it - what're the specific moves in the CSV game that are > > needed in terms of being able to swallow the IMDb data whole? (To allow > > configurable escape, that is?) > > > (As a workaround to be unblocked for testing/benchmarking I guess Mehnaz > > can break the no-changing lines rule in the very short term - but - > > that's not ideal because we want to talk to the owners of the benchmark > > she's using and say that we're using exactly their data.) > > > > On 6/11/24 4:13 PM, Ian Maxon wrote: > > > The problem is sort of multifaceted. DelimitedDataParser doesn't allow > > > configuration of the escape character. QuotedLineRecordReader does, but it > > > isn't parsing the fields. You also only get that if you specify > > > "format"="csv", and not "delimited-text". > > > The csv isn't compliant with what's stated in RFC4180. There, the escape > > > character is "". This is what DelimitedDataParser follows. If the line is > > > changed to use that ("" insteade of \"), it works fine. > > > I think we should consider supporting configurable escape during parse, > > > since it can't really be expected that CSV should follow that RFC strictly; > > > it is somewhat of an ad-hoc format. > > > > On Jun 11, 2024 at 08:30:48, Mike Carey<dtab...@gmail.com> wrote: > > > > > I’m told the relevant code is in QuotedLineRecordReader, that's where > > > > CSV/TSV parsing takes place, so you can have a look at what is happening > > > > there. There’s also an undocumented escape flag there (which we need to > > > > test and document). Others will probably have more details…. 🙂 > > > > > > > > On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin < > > > > mehnaztabassum.ma...@email.ucr.edu> wrote: > > > > > > > > Hello everyone, > > > > > > > > > > > > I am trying to load the IMDb dataset in AsterixDB. It seems that some of > > > > > > > > the rows end up with broken escaping and eventually not being inserted at > > > > > > > > all. For example, I used the syntax as follows: > > > > > > > > > > > > LOAD DATASET movie_companies using localfs ( > > > > > > > > ("path"=asterix_nc1://imdb-data/movie-companies.csv), > > > > > > > > ("format"="delimited-text"),("delimiter"=","), ("null"="") > > > > > > > > ); > > > > > > > > > > > > The schema is movie_companies (id: int, movie_id: int, company_id: int, > > > > > > > > company_type_id: int, note: string) and the CSV file contains the > > following > > > > > > > > row: > > > > > > > > > > > > 13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of > > > > > > > > Alfred Hitchcock, Vol. One\")" > > > > > > > > > > > > This row ends up not loading at all. The rest of the row with no such > > > > > > > > string input can be loaded successfully. > > > > > > > > > > > > Any suggestions? > > > > > > > > > > > > Thanks, > > > > > > > > Mehnaz > > > > > > > > > > > > > > > >