So suppose we make the long-term rule that we can't change any of the
lines in the file (:-)) - as that's the customer's data - and want to be
import all of it - what're the specific moves in the CSV game that are
needed in terms of being able to swallow the IMDb data whole? (To allow
configurable escape, that is?)
(As a workaround to be unblocked for testing/benchmarking I guess Mehnaz
can break the no-changing lines rule in the very short term - but -
that's not ideal because we want to talk to the owners of the benchmark
she's using and say that we're using exactly their data.)
On 6/11/24 4:13 PM, Ian Maxon wrote:
The problem is sort of multifaceted. DelimitedDataParser doesn't allow
configuration of the escape character. QuotedLineRecordReader does, but it
isn't parsing the fields. You also only get that if you specify
"format"="csv", and not "delimited-text".
The csv isn't compliant with what's stated in RFC4180. There, the escape
character is "". This is what DelimitedDataParser follows. If the line is
changed to use that ("" insteade of \"), it works fine.
I think we should consider supporting configurable escape during parse,
since it can't really be expected that CSV should follow that RFC strictly;
it is somewhat of an ad-hoc format.
On Jun 11, 2024 at 08:30:48, Mike Carey<dtab...@gmail.com> wrote:
I’m told the relevant code is in QuotedLineRecordReader, that's where
CSV/TSV parsing takes place, so you can have a look at what is happening
there. There’s also an undocumented escape flag there (which we need to
test and document). Others will probably have more details…. 🙂
On Mon, Jun 10, 2024 at 4:18 PM Mehnaz Tabassum Mahin <
mehnaztabassum.ma...@email.ucr.edu> wrote:
Hello everyone,
I am trying to load the IMDb dataset in AsterixDB. It seems that some of
the rows end up with broken escaping and eventually not being inserted at
all. For example, I used the syntax as follows:
LOAD DATASET movie_companies using localfs (
("path"=asterix_nc1://imdb-data/movie-companies.csv),
("format"="delimited-text"),("delimiter"=","), ("null"="")
);
The schema is movie_companies (id: int, movie_id: int, company_id: int,
company_type_id: int, note: string) and the CSV file contains the following
row:
13893, 53192, 1376, 1, "(1986) (USA) (VHS) (included in \"The Best Of
Alfred Hitchcock, Vol. One\")"
This row ends up not loading at all. The rest of the row with no such
string input can be loaded successfully.
Any suggestions?
Thanks,
Mehnaz