[
https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671795#comment-16671795
]
Brian Hulette commented on ARROW-3667:
--------------------------------------
I'm looking at adding a null column case to the integration tests, but it's not
clear what the JSON format should look like for a null type column.
I tried generating a json file using the C++ implementation to use as a guide,
but it turns out C++ actually fails to read the JSON it generates based on
{{bad.arrow}}
{code}
-> % ./cpp/build/debug/json-integration-test --integration --mode ARROW_TO_JSON
--arrow /tmp/bad.arrow --json /tmp/bad.json Found schema:
nulls: null
not nulls: string
__index_level_0__: int64
-- metadata --
pandas: {"index_columns": ["__index_level_0__"], "column_indexes": [{"name":
null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object",
"metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "nulls",
"field_name": "nulls", "pandas_type": "empty", "numpy_type": "object",
"metadata": null}, {"name": "not nulls", "field_name": "not nulls",
"pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name":
null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type":
"int64", "metadata": null}], "pandas_version": "0.23.4"}
-> % ./cpp/build/debug/json-integration-test --integration --mode JSON_TO_ARROW
--arrow /tmp/bad.arrow --json /tmp/bad.json
Found schema: nulls: null
not nulls: string
__index_level_0__: int64
Error message: Invalid: field VALIDITY not found{code}
Could someone familiar with the C++ implementation weigh in here? cc
[~wesmckinn] [~pitrou]
Here's what {{/tmp/bad.json}} looks like:
{code:json}
{
"schema": {
"fields": [
{
"name": "nulls",
"nullable": true,
"type": {
"name": "null"
},
"children": []
},
{
"name": "not nulls",
"nullable": true,
"type": {
"name": "utf8"
},
"children": []
},
{
"name": "__index_level_0__",
"nullable": true,
"type": {
"name": "int",
"bitWidth": 64,
"isSigned": true
},
"children": []
}
]
},
"batches": [
{
"count": 3,
"columns": [
{
"name": "nulls",
"count": 3,
"children": []
},
{
"name": "not nulls",
"count": 3,
"VALIDITY": [
1,
1,
1
],
"OFFSET": [
0,
3,
6,
9
],
"DATA": [
"abc",
"def",
"ghi"
],
"children": []
},
{
"name": "__index_level_0__",
"count": 3,
"VALIDITY": [
1,
1,
1
],
"DATA": [
0,
1,
2
],
"children": []
}
]
}
]
}
{code}
> [JS] Incorrectly reads record batches with an all null column
> -------------------------------------------------------------
>
> Key: ARROW-3667
> URL: https://issues.apache.org/jira/browse/ARROW-3667
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: JS-0.3.1
> Reporter: Brian Hulette
> Priority: Major
> Fix For: JS-0.4.0
>
>
> The JS library seems to incorrectly read any columns that come after an
> all-null column in IPC buffers produced by pyarrow.
> Here's a python script that generates two arrow buffers, one with an all-null
> column followed by a utf-8 column, and a second with those two reversed
> {code:python}
> import pyarrow as pa
> import pandas as pd
> def serialize_to_arrow(df, fd, compress=True):
> batch = pa.RecordBatch.from_pandas(df)
> writer = pa.RecordBatchFileWriter(fd, batch.schema)
> writer.write_batch(batch)
> writer.close()
> if __name__ == "__main__":
> df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc',
> 'def', 'ghi']}, columns=['nulls', 'not nulls'])
> with open('bad.arrow', 'wb') as fd:
> serialize_to_arrow(df, fd)
> df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
> with open('good.arrow', 'wb') as fd:
> serialize_to_arrow(df, fd)
> {code}
> JS incorrectly interprets the [null, not null] case:
> {code:javascript}
> > var arrow = require('apache-arrow')
> undefined
> > var fs = require('fs')
> undefined
> > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not
> > nulls').get(0)
> 'abc'
> > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
> '\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0006\u0000\u0000\u0000\t\u0000\u0000\u0000'
> {code}
> Presumably this is because pyarrow is omitting some (or all) of the buffers
> associated with the all-null column, but the JS IPC reader is still looking
> for them, causing the buffer count to get out of sync.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)