Hi Thomas,
You can read CSVs in the browser using the browser's File input APis and
an appropriate CSV library. The CSV library should be able to parse rows
into JS objects, which can then be passed to the Arrow Struct Builder
for serialization.
In this example[1] I'm parsing the first row of the CSV to determine the
schema, constructing an Arrow Builder transform function to parse each
row into a Struct column, then piping all the rows through the transform
stream and constructing Arrow RecordBatches.
The Builders propagate options like the null value representations down
to the child builders, or they can be configured separately by
specifying their own options.
A limitation of the Builders is that the schema must be known up-front.
If the schema needs to change mid-stream, either the current stream
should be terminated and a new one created, or the already-written data
should be re-run through a Builder with the new schema.
Best,
Paul
1. https://github.com/trxcllnt/csv-to-arrow-js/blob/master/index.js
On 9/28/20 3:19 AM, thomasroshin wrote:
Hello ,
I am working on a proof-of-concept for which I am having a bit of
trouble understanding apache-arrow with JS and wanted to clarify a few
things with this regard.
My use case-
I have a MEAN (MongoDB/Express/Angular/NodeJS) that connects to
customer databases and third-party data and performs analytics and
experimentations. In this regard I am looking at Apache arrow from
interoperability angle and performant analytics angle.
Right now I am working on the analytics side - From JS front end I need to
be able to read parquet and big-data CSV files. In this regard please
clarify my understanding :
1. I cannot read parquet file using arrow libraries directly (due to this
<https://issues.apache.org/jira/browse/ARROW-2786> issue). I have to use
something like parquetjs-lite
<https://www.npmjs.com/package/parquetjs-lite> for
this.
2. To read big-data CSV into apache-arrow, I have to first use Python
(pyarrow) to convert CSV to arrow format (as in
using-apache-arrow-js-with-large-datasets
<https://observablehq.com/@theneuralbit/using-apache-arrow-js-with-large-datasets>)
and then read the arrow file in my JS application.
a). If (2) above is correct then can I convert any third-party CSV
to arrow or should I have a predefined schema ahead of time ?
b). Are nulls and NaNs allowed in the CSV .
If the above understandings are right it seems rather a roundabout way (or
is it just me) . Are there any other paths you can suggest ?
regards,
Thomas