[GitHub] [arrow] trxcllnt commented on pull request #2035: ARROW-2116: [JS] implement IPC writers

GitBox Sat, 19 Sep 2020 12:47:59 -0700


trxcllnt commented on pull request #2035:
URL: https://github.com/apache/arrow/pull/2035#issuecomment-695349165

> @t829702 if JS is not the way to interact with Arrow, then what is the
purpose of JS implementation? Is the JS implementation supposed to be read-only
uses?

My comment was about the Arrow JSON IPC _representation_. The Arrow JS
library is a valid way to interact with Arrow data and stream.

We have a special JSON IPC reprepsentation to run integration tests between
the different Arrow language implementations. The `RecordBatchJSONReader` and
`RecordBatchJSONWriter` classes are used to read/write this format, but this
format is designed for reading and writing by humans, and would be very
inefficient for real-world use.

> can different columns share a single Dictionary (to save some space)

This isn't prohibited in the Arrow IPC format, but this is not supported by
the current `DictionaryBuilder` implementation. That would be a good feature to
add in [JIRA](https://issues.apache.org/jira/projects/ARROW).

> is there a better arrow data type for JavaScript Date's JSON string format?

No, that's the correct API for the Date data types. Providing a separate
utility in Arrow to parse dates would simply be duplicating the (relatively
well optimized) Date implementations provided by the JS VM.

If you want to parse the string directly into a numeric timestamp without
using Date, you could use one of the Arrow `Timestamp` data types instead of a
Date type.

> Is there a better way to create RecordBatch than the static method
`arrow.RecordBatch.new`?

No, this is the recommended way to construct a RecordBatch zero-copy.

> it runs in 2.3s convert a 50MB line-json file to a 21MB arrow file, not
too bad for each single file, most of my dataset source files are less than
100MB, but there are huge number of them, what's the better (optimized and
ergonomic way) to convert them?

You may have success using a newline-delimited JSON parser like
[ndjson](https://www.npmjs.com/package/ndjson) rather than JSON.parse. You can
also pass a custom `dictionaryHashFunction` to the DictionaryBuilder; I opted
for the ultra-fast [node metrohash](https://www.npmjs.com/package/metrohash)
implementation in the [csv-to-arrow
example](https://github.com/trxcllnt/csv-to-arrow-js/blob/f2596045474ce1742e3089da48a5c83a6005be90/index.js#L61).

The easiest approach to speeding this up is to run your script in parallel
on as many cores as you can afford, or use a different language implementation.
If you have a GPU available, you can use
[cuDF](https://docs.rapids.ai/api/cudf/stable/) to read newline-delimited JSON
as a cuDF DataFrame, then serialize the DataFrame to disk as an Arrow table.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] trxcllnt commented on pull request #2035: ARROW-2116: [JS] implement IPC writers

Reply via email to