Hi Thomas,

You can read CSVs in the browser using the browser's File input APis and an appropriate CSV library. The CSV library should be able to parse rows into JS objects, which can then be passed to the Arrow Struct Builder for serialization.

In this example[1] I'm parsing the first row of the CSV to determine the schema, constructing an Arrow Builder transform function to parse each row into a Struct column, then piping all the rows through the transform stream and constructing Arrow RecordBatches.

The Builders propagate options like the null value representations down to the child builders, or they can be configured separately by specifying their own options.

A limitation of the Builders is that the schema must be known up-front. If the schema needs to change mid-stream, either the current stream should be terminated and a new one created, or the already-written data should be re-run through a Builder with the new schema.

Best,

Paul

1. https://github.com/trxcllnt/csv-to-arrow-js/blob/master/index.js

On 9/28/20 3:19 AM, thomasroshin wrote:
Hello ,

      I am working on a proof-of-concept for which I am having a bit of
trouble understanding apache-arrow with JS and wanted to clarify a few
things with this regard.

My use case-
        I have a MEAN (MongoDB/Express/Angular/NodeJS) that connects to
customer databases and third-party data and performs analytics and
experimentations. In this regard I am looking at Apache arrow from
interoperability angle and performant analytics angle.

Right now I am working on the analytics side - From JS front end I need to
be able to read parquet and big-data CSV files. In this regard please
clarify my understanding :

1. I cannot read parquet file using arrow libraries directly (due to this
<https://issues.apache.org/jira/browse/ARROW-2786> issue). I have to use
something like parquetjs-lite
<https://www.npmjs.com/package/parquetjs-lite> for
this.
2. To read big-data CSV into apache-arrow, I have to first use Python
(pyarrow) to convert CSV to arrow format (as in
using-apache-arrow-js-with-large-datasets
<https://observablehq.com/@theneuralbit/using-apache-arrow-js-with-large-datasets>)
and then read the arrow file in my JS application.
       a). If (2)  above is correct then can I convert any third-party CSV
to arrow or should I have a predefined schema ahead of time ?
       b). Are nulls and NaNs allowed in the CSV .

If the above understandings are right it seems rather a roundabout way (or
is it just me) . Are there any other paths you can suggest ?

regards,
Thomas

Reply via email to