[
https://issues.apache.org/jira/browse/ARROW-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francois Saint-Jacques updated ARROW-8703:
------------------------------------------
Description:
Currently, I try to export numeric data plus some metadata in Python into to a
parquet file and read it in R. However, the metadata seems to be a dict in
Python but a string in R. I would have expected a list (which is roughly a dict
in Python). Am I missing something? Here is the code to demonstrate the issue:
{{import sys}}
{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{print(sys.version)}}
{{print(pa.__version__)}}
{{x = np.random.randint(0, 10, (10, 3))}}
{{arrays = [pa.array(x[:, i]) for i in range(x.shape[1])]}}
{{table = pa.Table.from_arrays(arrays=arrays, names=['A', 'B', 'C'],}}
{{ metadata=\{'foo': '42'})}}
{{pq.write_table(table, 'array.parquet', compression='snappy')}}
{{table = pq.read_table('array.parquet')}}
{{metadata = table.schema.metadata}}
{{print(metadata)}}
{{print(type(metadata))}}
And in R:
{{library(arrow)}}
{{print(R.version)}}
{{print(packageVersion("arrow"))}}
{{table <- read_parquet("array.parquet", as_data_frame = FALSE)}}
{{metadata <- table$schema$metadata}}
{{print(metadata)}}
{{print(is(metadata))}}
{{print(metadata["foo"])}}{{ }}
Output Python:
{{3.6.8 (default, Aug 7 2019, 17:28:10) }}
{{[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]}}
{{0.13.0}}
{{OrderedDict([(b'foo', b'42')])}}
{{<class 'collections.OrderedDict'>}}
Output R:
{{[1] ‘0.17.0’}}
{{[1] "\n-- metadata --\nfoo: 42"}}
{{[1] "character" "vector" "data.frameRowLabels"}}
{{[4] "SuperClassMethod" }}
{{[1] NA}}
was:
Currently, I try to export numeric data plus some metadata in Python into to a
parquet file and read it in R. However, the metadata seems to be a dict in
Python but a string in R. I would have expected a list (which is roughly a dict
in Python). Am I missing something? Here is the code to demonstrate the issue:
{{import sys}}{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}{{print(sys.version)}}
{{print(pa.__version__)}}{{x = np.random.randint(0, 10, (10, 3))}}
{{arrays = [pa.array(x[:, i]) for i in range(x.shape[1])]}}
{{table = pa.Table.from_arrays(arrays=arrays, names=['A', 'B', 'C'],}}
{{ metadata=\{'foo': '42'})}}
{{pq.write_table(table, 'array.parquet', compression='snappy')}}{{table =
pq.read_table('array.parquet')}}
{{metadata = table.schema.metadata}}
{{print(metadata)}}
{{print(type(metadata))}}
And in R:
{{library(arrow)}}{{print(R.version)}}
{{print(packageVersion("arrow"))}}{{table <- read_parquet("array.parquet",
as_data_frame = FALSE)}}
{{metadata <- table$schema$metadata}}
{{print(metadata)}}
{{print(is(metadata))}}
{{print(metadata["foo"])}}{{ }}
Output Python:
{{3.6.8 (default, Aug 7 2019, 17:28:10) }}
{{[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]}}
{{0.13.0}}
{{OrderedDict([(b'foo', b'42')])}}
{{<class 'collections.OrderedDict'>}}
Output R:
{{[1] ‘0.17.0’}}
{{[1] "\n-- metadata --\nfoo: 42"}}
{{[1] "character" "vector" "data.frameRowLabels"}}
{{[4] "SuperClassMethod" }}
{{[1] NA}}
> [R][Parquet] table$schema$metadata is a string
> ----------------------------------------------
>
> Key: ARROW-8703
> URL: https://issues.apache.org/jira/browse/ARROW-8703
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.17.0
> Reporter: René Rex
> Priority: Critical
>
> Currently, I try to export numeric data plus some metadata in Python into to
> a parquet file and read it in R. However, the metadata seems to be a dict in
> Python but a string in R. I would have expected a list (which is roughly a
> dict in Python). Am I missing something? Here is the code to demonstrate the
> issue:
> {{import sys}}
> {{import numpy as np}}
> {{import pyarrow as pa}}
> {{import pyarrow.parquet as pq}}
> {{print(sys.version)}}
> {{print(pa.__version__)}}
> {{x = np.random.randint(0, 10, (10, 3))}}
> {{arrays = [pa.array(x[:, i]) for i in range(x.shape[1])]}}
> {{table = pa.Table.from_arrays(arrays=arrays, names=['A', 'B', 'C'],}}
> {{ metadata=\{'foo': '42'})}}
> {{pq.write_table(table, 'array.parquet', compression='snappy')}}
> {{table = pq.read_table('array.parquet')}}
> {{metadata = table.schema.metadata}}
> {{print(metadata)}}
> {{print(type(metadata))}}
>
> And in R:
>
> {{library(arrow)}}
> {{print(R.version)}}
> {{print(packageVersion("arrow"))}}
> {{table <- read_parquet("array.parquet", as_data_frame = FALSE)}}
> {{metadata <- table$schema$metadata}}
> {{print(metadata)}}
> {{print(is(metadata))}}
> {{print(metadata["foo"])}}{{ }}
>
> Output Python:
> {{3.6.8 (default, Aug 7 2019, 17:28:10) }}
> {{[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]}}
> {{0.13.0}}
> {{OrderedDict([(b'foo', b'42')])}}
> {{<class 'collections.OrderedDict'>}}
>
> Output R:
> {{[1] ‘0.17.0’}}
> {{[1] "\n-- metadata --\nfoo: 42"}}
> {{[1] "character" "vector" "data.frameRowLabels"}}
> {{[4] "SuperClassMethod" }}
> {{[1] NA}}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)