Re: MEAN Stack Use-case understanding

Paul Taylor Tue, 29 Sep 2020 17:21:57 -0700

Hi Thomas,

You can read CSVs in the browser using the browser's File input APis andan appropriate CSV library. The CSV library should be able to parse rowsinto JS objects, which can then be passed to the Arrow Struct Builderfor serialization.

In this example[1] I'm parsing the first row of the CSV to determine theschema, constructing an Arrow Builder transform function to parse eachrow into a Struct column, then piping all the rows through the transformstream and constructing Arrow RecordBatches.

The Builders propagate options like the null value representations downto the child builders, or they can be configured separately byspecifying their own options.

A limitation of the Builders is that the schema must be known up-front.If the schema needs to change mid-stream, either the current streamshould be terminated and a new one created, or the already-written datashould be re-run through a Builder with the new schema.


Best,

Paul

1. https://github.com/trxcllnt/csv-to-arrow-js/blob/master/index.js

On 9/28/20 3:19 AM, thomasroshin wrote:

Hello ,

      I am working on a proof-of-concept for which I am having a bit of
trouble understanding apache-arrow with JS and wanted to clarify a few
things with this regard.

My use case-
        I have a MEAN (MongoDB/Express/Angular/NodeJS) that connects to
customer databases and third-party data and performs analytics and
experimentations. In this regard I am looking at Apache arrow from
interoperability angle and performant analytics angle.

Right now I am working on the analytics side - From JS front end I need to
be able to read parquet and big-data CSV files. In this regard please
clarify my understanding :

1. I cannot read parquet file using arrow libraries directly (due to this
<https://issues.apache.org/jira/browse/ARROW-2786> issue). I have to use
something like parquetjs-lite
<https://www.npmjs.com/package/parquetjs-lite> for
this.
2. To read big-data CSV into apache-arrow, I have to first use Python
(pyarrow) to convert CSV to arrow format (as in
using-apache-arrow-js-with-large-datasets
<https://observablehq.com/@theneuralbit/using-apache-arrow-js-with-large-datasets>)
and then read the arrow file in my JS application.
       a). If (2)  above is correct then can I convert any third-party CSV
to arrow or should I have a predefined schema ahead of time ?
       b). Are nulls and NaNs allowed in the CSV .

If the above understandings are right it seems rather a roundabout way (or
is it just me) . Are there any other paths you can suggest ?

regards,
Thomas

Re: MEAN Stack Use-case understanding

Reply via email to