t829702 edited a comment on pull request #2035:
URL: https://github.com/apache/arrow/pull/2035#issuecomment-696480501


   > Providing a separate utility in Arrow to parse dates
   
   I didn't mean to duplicate JS parsing code, but a way to provide a special 
parser function to the constructor, something like `new arrow.DateMillisecond( 
str => Date.parse(str) )`,  because my date str is already in 
`Date.toISOString` format, but if somebody else's date str representation is 
different,  they can pass in some sort of `d3.timeParse("...")`
   after then, somebody can emit stream of objects of 
`{"date":"2020-09-18T21:42:30.324Z", ...}` directly to 
`arrow.Builder.throughAsyncIterable(...)` 
   
   I have tried the Python implementation's json package, it has somewhat 
_Automatic Type Inference_, but it recognizes only `“YYYY-MM-DD”` and 
`“YYYY-MM-DD hh:mm:ss”`, only two time string formats, all others are left as 
strings(Utf8) that's no good, and I don't see it exposed any lower level way of 
tweaking to tell the types, I have tried to pass in explicit schema but not 
work as expected;
   and for my sender and receiver fields, were left over all strings, not 
helping to try dictionary at all;
   https://arrow.apache.org/docs/python/json.html#automatic-type-inference
   
   Will this be an advanced usage if each DataType can have an optional parser?
   1. `new arrow.DateMillisecond( str => Date.parse(str) )` for the JS standard 
`"2020-09-18T21:42:30.324Z"` time string
   2. `new arrow.DateMillisecond( d3.timeParse("%Y-%m-%d %H:%M:%S") )` for 
others like `"2020-09-18 21:42:30"`
   
   the Rust implementation's json package looks like having a nicer 
`arrow::json::reader::infer_json_schema` I haven't tried yet
   https://docs.rs/arrow/1.0.1/arrow/json/reader/fn.infer_json_schema.html
   
   when I said a `50MB line-json file` it's already in `ndjson` format, each 
line is a compact and valid JSON, but whole file is not; I have read its code 
https://github.com/ndjson/ndjson.js/blob/master/index.js#L17 underlying it's 
same as `JSON.parse` after reading each line, shouldn't be much faster but can 
save some LOC
   
   I have read your csv-to-arrow; thanks again for all questions answered, but 
one more is why not ship it inside the apache-arrow NPM package, just because 
all other implementations have packaged one or more helper utils, that's really 
helpful when working with binary arrow files on the command line, and for even 
Shell scripts can do a lot of paralleled work in cron jobs...
   1. One major feature would be the infer schema, equivalent to Python json's 
_Automatic Type Inference_ 
https://github.com/trxcllnt/csv-to-arrow-js/blob/master/index.js#L51
   2. and we can also have a `json-to-arrow` helper, with smart inferring schema
   
   Would be nice if the `infer schema` (or _Automatic Type Inference_) can have
   1. recognize some more popular time string formats, something like in the 
[`d3.autoType`](https://github.com/d3/d3-dsv/blob/master/src/autoType.js#L9-L11)
   2. detect number value's range and precision, use only the minimum to cover 
all values (say if Int32 can cover all values, don't use Int64; and if Float32 
covers all, don't use Float64)
   3. try best to use dictionary if a string column have small number of 
cardinals (say less than 10% of total number of rows)
   4. and else?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to