trxcllnt commented on pull request #2035:
URL: https://github.com/apache/arrow/pull/2035#issuecomment-695349165


   > @t829702 if JS is not the way to interact with Arrow, then what is the 
purpose of JS implementation? Is the JS implementation supposed to be read-only 
uses?
   
   My comment was about the Arrow JSON IPC _representation_. The Arrow JS 
library is a valid way to interact with Arrow data and stream.
   
   We have a special JSON IPC reprepsentation to run integration tests between 
the different Arrow language implementations. The `RecordBatchJSONReader` and 
`RecordBatchJSONWriter` classes are used to read/write this format, but this 
format is designed for reading and writing by humans, and would be very 
inefficient for real-world use.
   
   > can different columns share a single Dictionary (to save some space)
   
   This isn't prohibited in the Arrow IPC format, but this is not supported by 
the current `DictionaryBuilder` implementation. That would be a good feature to 
add in [JIRA](https://issues.apache.org/jira/projects/ARROW).
   
   > is there a better arrow data type for JavaScript Date's JSON string format?
   
   No, that's the correct API for the Date data types. Providing a separate 
utility in Arrow to parse dates would simply be duplicating the (relatively 
well optimized) Date implementations provided by the JS VM.
   
   If you want to parse the string directly into a numeric timestamp without 
using Date, you could use one of the Arrow `Timestamp` data types instead of a 
Date type.
   
   > Is there a better way to create RecordBatch than the static method 
`arrow.RecordBatch.new`?
   
   No, this is the recommended way to construct a RecordBatch zero-copy.
   
   > it runs in 2.3s convert a 50MB line-json file to a 21MB arrow file, not 
too bad for each single file, most of my dataset source files are less than 
100MB, but there are huge number of them, what's the better (optimized and 
ergonomic way) to convert them?
   
   You may have success using a newline-delimited JSON parser like 
[ndjson](https://www.npmjs.com/package/ndjson) rather than JSON.parse. You can 
also pass a custom `dictionaryHashFunction` to the DictionaryBuilder; I opted 
for the ultra-fast [node metrohash](https://www.npmjs.com/package/metrohash) 
implementation in the [csv-to-arrow 
example](https://github.com/trxcllnt/csv-to-arrow-js/blob/f2596045474ce1742e3089da48a5c83a6005be90/index.js#L61).
   
   The easiest approach to speeding this up is to run your script in parallel 
on as many cores as you can afford, or use a different language implementation. 
If you have a GPU available, you can use 
[cuDF](https://docs.rapids.ai/api/cudf/stable/) to read newline-delimited JSON 
as a cuDF DataFrame, then serialize the DataFrame to disk as an Arrow table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to