[jira] [Commented] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

Brian Hulette (JIRA) Thu, 01 Nov 2018 08:59:12 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671795#comment-16671795
 ]


Brian Hulette commented on ARROW-3667:
--------------------------------------

I'm looking at adding a null column case to the integration tests, but it's not 
clear what the JSON format should look like for a null type column.

I tried generating a json file using the C++ implementation to use as a guide, 
but it turns out C++ actually fails to read the JSON it generates based on 
{{bad.arrow}}
{code}
-> % ./cpp/build/debug/json-integration-test --integration --mode ARROW_TO_JSON 
--arrow /tmp/bad.arrow --json /tmp/bad.json Found schema:
nulls: null
not nulls: string
__index_level_0__: int64
-- metadata --
pandas: {"index_columns": ["__index_level_0__"], "column_indexes": [{"name": 
null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
"metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "nulls", 
"field_name": "nulls", "pandas_type": "empty", "numpy_type": "object", 
"metadata": null}, {"name": "not nulls", "field_name": "not nulls", 
"pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": 
null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": 
"int64", "metadata": null}], "pandas_version": "0.23.4"}
-> % ./cpp/build/debug/json-integration-test --integration --mode JSON_TO_ARROW 
--arrow /tmp/bad.arrow --json /tmp/bad.json
Found schema: nulls: null
not nulls: string
__index_level_0__: int64
Error message: Invalid: field VALIDITY not found{code}

Could someone familiar with the C++ implementation weigh in here? cc 
[~wesmckinn] [~pitrou]
Here's what {{/tmp/bad.json}} looks like:

{code:json}
{
  "schema": {
    "fields": [
      {
        "name": "nulls",
        "nullable": true,
        "type": {
          "name": "null"
        },
        "children": []
      },
      {
        "name": "not nulls",
        "nullable": true,
        "type": {
          "name": "utf8"
        },
        "children": []
      },
      {
        "name": "__index_level_0__",
        "nullable": true,
        "type": {
          "name": "int",
          "bitWidth": 64,
          "isSigned": true
        },
        "children": []
      }
    ]
  },
  "batches": [
    {
      "count": 3,
      "columns": [
        {
          "name": "nulls",
          "count": 3,
          "children": []
        },
        {
          "name": "not nulls",
          "count": 3,
          "VALIDITY": [
            1,
            1,
            1
          ],
          "OFFSET": [
            0,
            3,
            6,
            9
          ],
          "DATA": [
            "abc",
            "def",
            "ghi"
          ],
          "children": []
        },
        {
          "name": "__index_level_0__",
          "count": 3,
          "VALIDITY": [
            1,
            1,
            1
          ],
          "DATA": [
            0,
            1,
            2
          ],
          "children": []
        }
      ]
    }
  ]
}
{code}



> [JS] Incorrectly reads record batches with an all null column
> -------------------------------------------------------------
>
>                 Key: ARROW-3667
>                 URL: https://issues.apache.org/jira/browse/ARROW-3667
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: JS-0.3.1
>            Reporter: Brian Hulette
>            Priority: Major
>             Fix For: JS-0.4.0
>
>
> The JS library seems to incorrectly read any columns that come after an 
> all-null column in IPC buffers produced by pyarrow.
> Here's a python script that generates two arrow buffers, one with an all-null 
> column followed by a utf-8 column, and a second with those two reversed
> {code:python}
> import pyarrow as pa
> import pandas as pd
> def serialize_to_arrow(df, fd, compress=True):
>   batch = pa.RecordBatch.from_pandas(df)
>   writer = pa.RecordBatchFileWriter(fd, batch.schema)
>   writer.write_batch(batch)
>   writer.close()
> if __name__ == "__main__":
>     df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 
> 'def', 'ghi']}, columns=['nulls', 'not nulls'])
>     with open('bad.arrow', 'wb') as fd:
>         serialize_to_arrow(df, fd)
>     df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
>     with open('good.arrow', 'wb') as fd:
>         serialize_to_arrow(df, fd)
> {code}
> JS incorrectly interprets the [null, not null] case:
> {code:javascript}
> > var arrow = require('apache-arrow')
> undefined
> > var fs = require('fs')
> undefined
> > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not 
> > nulls').get(0)
> 'abc'
> > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
> '\u0000\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u0006\u0000\u0000\u0000\t\u0000\u0000\u0000'
> {code}
> Presumably this is because pyarrow is omitting some (or all) of the buffers 
> associated with the all-null column, but the JS IPC reader is still looking 
> for them, causing the buffer count to get out of sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

Reply via email to