[jira] [Created] (ARROW-3776) [Rust] Mark methods that do not perform bounds checking as unsafe
Paddy Horan created ARROW-3776: -- Summary: [Rust] Mark methods that do not perform bounds checking as unsafe Key: ARROW-3776 URL: https://issues.apache.org/jira/browse/ARROW-3776 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Paddy Horan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet
[ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3238. - Resolution: Not A Problem I don't believe there is anything we can fix here > [Python] Can't read pyarrow string columns in fastparquet > - > > Key: ARROW-3238 > URL: https://issues.apache.org/jira/browse/ARROW-3238 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Theo Walker >Priority: Major > Labels: parquet > > Writing really long strings from pyarrow causes exception in fastparquet read. > {code:java} > Traceback (most recent call last): > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in > read_fastparquet() > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in > read_fastparquet > dff = pf.to_pandas(['A']) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 426, in to_pandas > index=index, assign=parts) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 258, in read_row_group > scheme=self.file_scheme) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 344, in read_row_group > cats, selfmade, assign=assign) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 321, in read_row_group_arrays > catdef=out.get(name+'-catdef', None)) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 235, in read_col > skip_nulls, selfmade=selfmade) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 99, in read_data_page > raw_bytes = _read_page(f, header, metadata) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 31, in _read_page > page_header.uncompressed_page_size) > AssertionError: found 175532 raw bytes (expected 200026){code} > If written with compression, it reports compression errors instead: > {code:java} > SNAPPY: snappy.UncompressError: Error while decompressing: invalid input > GZIP: zlib.error: Error -3 while decompressing data: incorrect header > check{code} > > > Minimal code to reproduce: > {code:java} > import os > import pandas as pd > import pyarrow > import pyarrow.parquet as arrow_pq > from fastparquet import ParquetFile > # data to generate > ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception > N_ROWS = 10 > # file write params > ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is > read (e.g. Nones) > FILENAME = 'test.parquet' > def write_arrow(): > df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]}) > if os.path.isfile(FILENAME): > os.remove(FILENAME) > arrow_table = pyarrow.Table.from_pandas(df) > arrow_pq.write_table(arrow_table, > FILENAME, > use_dictionary=False, > compression='NONE', > row_group_size=ROW_GROUP_SIZE) > def read_arrow(): > print "arrow:" > table2 = arrow_pq.read_table(FILENAME) > print table2.to_pandas().head() > def read_fastparquet(): > print "fastparquet:" > pf = ParquetFile(FILENAME) > dff = pf.to_pandas(['A']) > print dff.head() > write_arrow() > read_arrow() > read_fastparquet() > {code} > Versions: > {code:java} > fastparquet==0.1.6 > pyarrow==0.10.0 > pandas==0.22.0 > sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, > 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code} > Also opened issue here: https://github.com/dask/fastparquet/issues/375 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contiguous arrays
[ https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3774: Summary: [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contiguous arrays (was: [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contigous arrays) > [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into > contiguous arrays > - > > Key: ARROW-3774 > URL: https://issues.apache.org/jira/browse/ARROW-3774 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > > Instead of creating a chunk per RowGroup, we should read at least for > primitive type into a single, pre-allocated Array. This needs some new > functionality in the Record reader classes and thus should be done after > https://github.com/apache/parquet-cpp/pull/462 is merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3766) [Python] pa.Table.from_pandas doesn't use schema ordering
[ https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3766: Summary: [Python] pa.Table.from_pandas doesn't use schema ordering (was: pa.Table.from_pandas doesn't use schema ordering) > [Python] pa.Table.from_pandas doesn't use schema ordering > - > > Key: ARROW-3766 > URL: https://issues.apache.org/jira/browse/ARROW-3766 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > Pyarrow is sensitive to the order of the columns upon load of partitioned > Files. > With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we > can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} > object does use the ordering of pandas columns rather than the schema > columns. Furthermore it is possible to have columns in the schema but not in > the DataFrame (and hence in the resulting pa.Table). > This behaviour requires a lot of fiddling with the pandas Frame in the first > place if we like to write compatible partitioned files. Hence I argue that > for {{pa.Table.from_pandas}}, and any other comparable function, the schema > should be the principal source for the Table structure and not the columns > and the ordering in the pandas DataFrame. If I specify a schema I simply > expect that the resulting Table actually has this schema. > Here is a little example. If you remove the reordering of df2 everything > works fine: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > df1 = df[df.partition_column==0] > df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] > table1 = pa.Table.from_pandas(df1, schema=my_schema) > table2 = pa.Table.from_pandas(df2, schema=my_schema) > pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) > pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) > pd.read_parquet(PATH_PYARROW_MANUAL) > {code} > If -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3775) [C++] Handling Arrow reads that overflow a BinaryArray capacity
[ https://issues.apache.org/jira/browse/ARROW-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1186 to ARROW-3775: -- Fix Version/s: (was: cpp-1.5.0) 0.12.0 Component/s: (was: parquet-cpp) C++ Workflow: jira (was: patch-available, re-open possible) Key: ARROW-3775 (was: PARQUET-1186) Project: Apache Arrow (was: Parquet) > [C++] Handling Arrow reads that overflow a BinaryArray capacity > --- > > Key: ARROW-3775 > URL: https://issues.apache.org/jira/browse/ARROW-3775 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > See comment thread in > https://stackoverflow.com/questions/48115087/converting-parquetfile-to-pandas-dataframe-with-a-column-with-a-set-of-string-in > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3775) [C++] Handling Arrow reads that overflow a BinaryArray capacity
[ https://issues.apache.org/jira/browse/ARROW-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3775: Labels: parquet (was: ) > [C++] Handling Arrow reads that overflow a BinaryArray capacity > --- > > Key: ARROW-3775 > URL: https://issues.apache.org/jira/browse/ARROW-3775 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > See comment thread in > https://stackoverflow.com/questions/48115087/converting-parquetfile-to-pandas-dataframe-with-a-column-with-a-set-of-string-in > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3771) [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to specify chunksize
[ https://issues.apache.org/jira/browse/ARROW-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3771: Labels: parquet (was: ) > [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to > specify chunksize > > > Key: ARROW-3771 > URL: https://issues.apache.org/jira/browse/ARROW-3771 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Xianjin YE >Priority: Minor > Labels: parquet > > see [https://github.com/apache/parquet-cpp/pull/445] comments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3775) [C++] Handling Arrow reads that overflow a BinaryArray capacity
[ https://issues.apache.org/jira/browse/ARROW-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684428#comment-16684428 ] Wes McKinney commented on ARROW-3775: - Moved to Arrow. I think this might be a duplicate issue > [C++] Handling Arrow reads that overflow a BinaryArray capacity > --- > > Key: ARROW-3775 > URL: https://issues.apache.org/jira/browse/ARROW-3775 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > See comment thread in > https://stackoverflow.com/questions/48115087/converting-parquetfile-to-pandas-dataframe-with-a-column-with-a-set-of-string-in > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3771) [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to specify chunksize
[ https://issues.apache.org/jira/browse/ARROW-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1257 to ARROW-3771: -- Component/s: (was: parquet-cpp) C++ External issue ID: (was: ARROW-2360) Workflow: jira (was: patch-available, re-open possible) Key: ARROW-3771 (was: PARQUET-1257) Project: Apache Arrow (was: Parquet) > [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to > specify chunksize > > > Key: ARROW-3771 > URL: https://issues.apache.org/jira/browse/ARROW-3771 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Xianjin YE >Priority: Minor > Labels: parquet > > see [https://github.com/apache/parquet-cpp/pull/445] comments -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contigous arrays
[ https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3774: Summary: [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contigous arrays (was: [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays) > [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contigous > arrays > > > Key: ARROW-3774 > URL: https://issues.apache.org/jira/browse/ARROW-3774 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > > Instead of creating a chunk per RowGroup, we should read at least for > primitive type into a single, pre-allocated Array. This needs some new > functionality in the Record reader classes and thus should be done after > https://github.com/apache/parquet-cpp/pull/462 is merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays
[ https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3774: Labels: parquet (was: ) > [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into > continuous arrays > - > > Key: ARROW-3774 > URL: https://issues.apache.org/jira/browse/ARROW-3774 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > > Instead of creating a chunk per RowGroup, we should read at least for > primitive type into a single, pre-allocated Array. This needs some new > functionality in the Record reader classes and thus should be done after > https://github.com/apache/parquet-cpp/pull/462 is merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays
[ https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1393 to ARROW-3774: -- Fix Version/s: (was: cpp-1.6.0) Component/s: (was: parquet-cpp) C++ Workflow: jira (was: patch-available, re-open possible) Issue Type: Improvement (was: New Feature) Key: ARROW-3774 (was: PARQUET-1393) Project: Apache Arrow (was: Parquet) > [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into > continuous arrays > - > > Key: ARROW-3774 > URL: https://issues.apache.org/jira/browse/ARROW-3774 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > > Instead of creating a chunk per RowGroup, we should read at least for > primitive type into a single, pre-allocated Array. This needs some new > functionality in the Record reader classes and thus should be done after > https://github.com/apache/parquet-cpp/pull/462 is merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3773) [C++] Remove duplicated AssertArraysEqual code in parquet/arrow/arrow-reader-writer-test.cc
[ https://issues.apache.org/jira/browse/ARROW-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3773: Summary: [C++] Remove duplicated AssertArraysEqual code in parquet/arrow/arrow-reader-writer-test.cc (was: [C++] Fix AssertArraysEqual call) > [C++] Remove duplicated AssertArraysEqual code in > parquet/arrow/arrow-reader-writer-test.cc > --- > > Key: ARROW-3773 > URL: https://issues.apache.org/jira/browse/ARROW-3773 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3773) [C++] Fix AssertArraysEqual call
[ https://issues.apache.org/jira/browse/ARROW-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3773: Labels: parquet (was: ) > [C++] Fix AssertArraysEqual call > > > Key: ARROW-3773 > URL: https://issues.apache.org/jira/browse/ARROW-3773 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3773) [C++] Fix AssertArraysEqual call
[ https://issues.apache.org/jira/browse/ARROW-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1127 to ARROW-3773: -- Fix Version/s: (was: cpp-1.5.0) 0.12.0 Affects Version/s: (was: cpp-1.2.0) Component/s: (was: parquet-cpp) C++ Workflow: jira (was: patch-available, re-open possible) Issue Type: Improvement (was: Bug) Key: ARROW-3773 (was: PARQUET-1127) Project: Apache Arrow (was: Parquet) > [C++] Fix AssertArraysEqual call > > > Key: ARROW-3773 > URL: https://issues.apache.org/jira/browse/ARROW-3773 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684422#comment-16684422 ] Wes McKinney commented on ARROW-3772: - Moved issue to Arrow issue tracker > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1324 to ARROW-3772: -- Fix Version/s: (was: cpp-1.6.0) 0.13.0 Component/s: (was: parquet-cpp) C++ Workflow: jira (was: patch-available, re-open possible) Key: ARROW-3772 (was: PARQUET-1324) Project: Apache Arrow (was: Parquet) > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3772: Labels: parquet (was: ) > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3770) [C++] Validate or add option to validate arrow::Table schema in parquet::arrow::FileWriter::WriteTable
[ https://issues.apache.org/jira/browse/ARROW-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3770: Component/s: C++ > [C++] Validate or add option to validate arrow::Table schema in > parquet::arrow::FileWriter::WriteTable > -- > > Key: ARROW-3770 > URL: https://issues.apache.org/jira/browse/ARROW-3770 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > Failing to validate will cause a segfault when the passed table does not > match the schema used to instantiate the writer. See ARROW-2926 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3770) [C++] Validate or add option to validate arrow::Table schema in parquet::arrow::FileWriter::WriteTable
[ https://issues.apache.org/jira/browse/ARROW-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1362 to ARROW-3770: -- Fix Version/s: (was: cpp-1.6.0) Component/s: (was: parquet-cpp) Workflow: jira (was: patch-available, re-open possible) Key: ARROW-3770 (was: PARQUET-1362) Project: Apache Arrow (was: Parquet) > [C++] Validate or add option to validate arrow::Table schema in > parquet::arrow::FileWriter::WriteTable > -- > > Key: ARROW-3770 > URL: https://issues.apache.org/jira/browse/ARROW-3770 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > Failing to validate will cause a segfault when the passed table does not > match the schema used to instantiate the writer. See ARROW-2926 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3770) [C++] Validate or add option to validate arrow::Table schema in parquet::arrow::FileWriter::WriteTable
[ https://issues.apache.org/jira/browse/ARROW-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3770: Labels: parquet (was: ) > [C++] Validate or add option to validate arrow::Table schema in > parquet::arrow::FileWriter::WriteTable > -- > > Key: ARROW-3770 > URL: https://issues.apache.org/jira/browse/ARROW-3770 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > Failing to validate will cause a segfault when the passed table does not > match the schema used to instantiate the writer. See ARROW-2926 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3769) [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684401#comment-16684401 ] Wes McKinney commented on ARROW-3769: - Moved this here from the Parquet JIRA > [C++] Support reading non-dictionary encoded binary Parquet columns directly > as DictionaryArray > --- > > Key: ARROW-3769 > URL: https://issues.apache.org/jira/browse/ARROW-3769 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > If the goal is to hash this data anyway into a categorical-type array, then > it would be better to offer the option to "push down" the hashing into the > Parquet read hot path rather than first fully materializing a dense vector of > {{ByteArray}} values, which could use a lot of memory after decompression -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (ARROW-3769) [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved PARQUET-1423 to ARROW-3769: -- Fix Version/s: (was: cpp-1.6.0) 0.13.0 Component/s: (was: parquet-cpp) C++ Workflow: jira (was: patch-available, re-open possible) Key: ARROW-3769 (was: PARQUET-1423) Project: Apache Arrow (was: Parquet) > [C++] Support reading non-dictionary encoded binary Parquet columns directly > as DictionaryArray > --- > > Key: ARROW-3769 > URL: https://issues.apache.org/jira/browse/ARROW-3769 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > If the goal is to hash this data anyway into a categorical-type array, then > it would be better to offer the option to "push down" the hashing into the > Parquet read hot path rather than first fully materializing a dense vector of > {{ByteArray}} values, which could use a lot of memory after decompression -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3769) [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3769: Labels: parquet (was: ) > [C++] Support reading non-dictionary encoded binary Parquet columns directly > as DictionaryArray > --- > > Key: ARROW-3769 > URL: https://issues.apache.org/jira/browse/ARROW-3769 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > If the goal is to hash this data anyway into a categorical-type array, then > it would be better to offer the option to "push down" the hashing into the > Parquet read hot path rather than first fully materializing a dense vector of > {{ByteArray}} values, which could use a lot of memory after decompression -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684214#comment-16684214 ] Pindikura Ravindra commented on ARROW-3738: --- I'm fine with moving date.h to arrow/util, [~wesmckinn] > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684184#comment-16684184 ] Wes McKinney commented on ARROW-3738: - It looks like this has already happened in https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/date.h. I suggest we move {{date.h}} to {{arrow/util}} [~pravindra] sound ok? > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684181#comment-16684181 ] Antoine Pitrou commented on ARROW-3738: --- To keep things simple, I suggest we start using a date library. The following looks good (and is actually the basis for the future C++20 API): https://github.com/HowardHinnant/date We could simply vendor the {{date.h}} file. > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3768) [Python] set classpath to hdfs not hadoop executable
[ https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3768: Summary: [Python] set classpath to hdfs not hadoop executable (was: set classpath to hdfs not hadoop executable) > [Python] set classpath to hdfs not hadoop executable > > > Key: ARROW-3768 > URL: https://issues.apache.org/jira/browse/ARROW-3768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Andrew Harris >Priority: Major > > The documentation for connecting to hdfs from pyarrow shows using the `hdfs` > executable for setting the CLASSPATH. However in > `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My > understanding is that we will want `_maybe_set_hadoop_classpath` to set > CLASSPATH to the `hdfs` executable as documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3768) [Python] set classpath to hdfs not hadoop executable
[ https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684003#comment-16684003 ] Wes McKinney commented on ARROW-3768: - The issue will be resolved/closed once a patch is merged > [Python] set classpath to hdfs not hadoop executable > > > Key: ARROW-3768 > URL: https://issues.apache.org/jira/browse/ARROW-3768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Andrew Harris >Priority: Major > > The documentation for connecting to hdfs from pyarrow shows using the `hdfs` > executable for setting the CLASSPATH. However in > `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My > understanding is that we will want `_maybe_set_hadoop_classpath` to set > CLASSPATH to the `hdfs` executable as documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (ARROW-3768) set classpath to hdfs not hadoop executable
[ https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-3768: - > set classpath to hdfs not hadoop executable > --- > > Key: ARROW-3768 > URL: https://issues.apache.org/jira/browse/ARROW-3768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Andrew Harris >Priority: Major > > The documentation for connecting to hdfs from pyarrow shows using the `hdfs` > executable for setting the CLASSPATH. However in > `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My > understanding is that we will want `_maybe_set_hadoop_classpath` to set > CLASSPATH to the `hdfs` executable as documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3768) set classpath to hdfs not hadoop executable
Andrew Harris created ARROW-3768: Summary: set classpath to hdfs not hadoop executable Key: ARROW-3768 URL: https://issues.apache.org/jira/browse/ARROW-3768 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Andrew Harris The documentation for connecting to hdfs from pyarrow shows using the `hdfs` executable for setting the CLASSPATH. However in `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My understanding is that we will want `_maybe_set_hadoop_classpath` to set CLASSPATH to the `hdfs` executable as documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3439) [R] R language bindings for Feather format
[ https://issues.apache.org/jira/browse/ARROW-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3439: -- Labels: pull-request-available (was: ) > [R] R language bindings for Feather format > -- > > Key: ARROW-3439 > URL: https://issues.apache.org/jira/browse/ARROW-3439 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > This will enable work on a "Feather v2" to commence and so that the codebase > in github.om/wesm/feather can be finally deprecated -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3768) set classpath to hdfs not hadoop executable
[ https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Harris closed ARROW-3768. Resolution: Fixed > set classpath to hdfs not hadoop executable > --- > > Key: ARROW-3768 > URL: https://issues.apache.org/jira/browse/ARROW-3768 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Andrew Harris >Priority: Major > > The documentation for connecting to hdfs from pyarrow shows using the `hdfs` > executable for setting the CLASSPATH. However in > `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My > understanding is that we will want `_maybe_set_hadoop_classpath` to set > CLASSPATH to the `hdfs` executable as documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering
[ https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3766: Fix Version/s: 0.12.0 > pa.Table.from_pandas doesn't use schema ordering > > > Key: ARROW-3766 > URL: https://issues.apache.org/jira/browse/ARROW-3766 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > Pyarrow is sensitive to the order of the columns upon load of partitioned > Files. > With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we > can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} > object does use the ordering of pandas columns rather than the schema > columns. Furthermore it is possible to have columns in the schema but not in > the DataFrame (and hence in the resulting pa.Table). > This behaviour requires a lot of fiddling with the pandas Frame in the first > place if we like to write compatible partitioned files. Hence I argue that > for {{pa.Table.from_pandas}}, and any other comparable function, the schema > should be the principal source for the Table structure and not the columns > and the ordering in the pandas DataFrame. If I specify a schema I simply > expect that the resulting Table actually has this schema. > Here is a little example. If you remove the reordering of df2 everything > works fine: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import os > import numpy as np > import shutil > PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' > if os.path.exists(PATH_PYARROW_MANUAL): > shutil.rmtree(PATH_PYARROW_MANUAL) > os.mkdir(PATH_PYARROW_MANUAL) > arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) > strings = np.array([np.nan, np.nan, 'a', 'b']) > df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) > df.index.name='DPRD_ID' > df['arrays'] = pd.Series(arrays) > df['strings'] = pd.Series(strings) > my_schema = pa.schema([('DPRD_ID', pa.int64()), >('partition_column', pa.int32()), >('arrays', pa.list_(pa.int32())), >('strings', pa.string()), >('new_column', pa.string())]) > df1 = df[df.partition_column==0] > df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] > table1 = pa.Table.from_pandas(df1, schema=my_schema) > table2 = pa.Table.from_pandas(df2, schema=my_schema) > pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) > pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) > pd.read_parquet(PATH_PYARROW_MANUAL) > {code} > If -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683886#comment-16683886 ] Wes McKinney commented on ARROW-3738: - Those sound like the right ones. date64 does not support granularity beyond the resolution of a day. The values are supposed to be a multiple of 8640. Some things use this millisecond-based representation of calendar dates A timestamp considers intraday point in time > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3762) [C++] Arrow table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683875#comment-16683875 ] Wes McKinney commented on ARROW-3762: - No -- there have been lots of cases where people have written on bug reports (not necessarily in Apache Arrow) things "This is really causing a problem for me, when will it be fixed?" which you did not do, so thank you =) One time the comment was "This is a deal breaker for me" > [C++] Arrow table reads error when overflowing capacity of BinaryArray > -- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug >Reporter: Chris Ellison >Priority: Major > Fix For: 0.12.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3767) [C++] Add cast for Null to any type
[ https://issues.apache.org/jira/browse/ARROW-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683866#comment-16683866 ] Wes McKinney commented on ARROW-3767: - Some memory buffers will have to be allocated to conform to the columnar format (for primitive types, string, lists, etc.) but that shouldn't be too bad to build. We should try to use as much common utility code for this as possible > [C++] Add cast for Null to any type > --- > > Key: ARROW-3767 > URL: https://issues.apache.org/jira/browse/ARROW-3767 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.13.0 > > > Casting a column from NullType to any other type is possible as the resulting > array will also be all-null but simply with a different type annotation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3767) [C++] Add cast for Null to any type
Uwe L. Korn created ARROW-3767: -- Summary: [C++] Add cast for Null to any type Key: ARROW-3767 URL: https://issues.apache.org/jira/browse/ARROW-3767 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Uwe L. Korn Fix For: 0.13.0 Casting a column from NullType to any other type is possible as the resulting array will also be all-null but simply with a different type annotation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683527#comment-16683527 ] Antoine Pitrou commented on ARROW-3738: --- What formats exactly should we allow? I'm leaning towards "{{-MM-DD}}" and "{{-MM-DD[ T]hh:mm:ss[Z]}}". Also, what is the difference between the Arrow "date64" and "timestamp" types? > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering
[ https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Thiel updated ARROW-3766: --- Description: Pyarrow is sensitive to the order of the columns upon load of partitioned Files. With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object does use the ordering of pandas columns rather than the schema columns. Furthermore it is possible to have columns in the schema but not in the DataFrame (and hence in the resulting pa.Table). This behaviour requires a lot of fiddling with the pandas Frame in the first place if we like to write compatible partitioned files. Hence I argue that for {{pa.Table.from_pandas}}, and any other comparable function, the schema should be the principal source for the Table structure and not the columns and the ordering in the pandas DataFrame. If I specify a schema I simply expect that the resulting Table actually has this schema. Here is a little example. If you remove the reordering of df2 everything works fine: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import os import numpy as np import shutil PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) strings = np.array([np.nan, np.nan, 'a', 'b']) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df.index.name='DPRD_ID' df['arrays'] = pd.Series(arrays) df['strings'] = pd.Series(strings) my_schema = pa.schema([('DPRD_ID', pa.int64()), ('partition_column', pa.int32()), ('arrays', pa.list_(pa.int32())), ('strings', pa.string()), ('new_column', pa.string())]) df1 = df[df.partition_column==0] df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] table1 = pa.Table.from_pandas(df1, schema=my_schema) table2 = pa.Table.from_pandas(df2, schema=my_schema) pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) pd.read_parquet(PATH_PYARROW_MANUAL) {code} If was: Pyarrow is sensitive to the order of the columns upon load of partitioned Files. With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object does use the ordering of pandas columns rather than the schema columns. Furthermore it is possible to have columns in the schema but not in the DataFrame (and hence in the resulting pa.Table). This behaviour requires a lot of fiddling with the pandas Frame in the first place if we like to write compatible partitioned files. Hence I argue that for {{pa.Table.from_pandas}}, and any other comparable function, the schema should be the principal source for the Table structure and not the columns and the ordering in the pandas DataFrame. If I specify a schema I simply expect that the resulting Table actually has this schema. Here is a little example. If you remove the reordering of df2 everything works fine: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import os import numpy as np import shutil PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) strings = np.array([np.nan, np.nan, 'a', 'b']) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df.index.name='DPRD_ID' df['arrays'] = pd.Series(arrays) df['strings'] = pd.Series(strings) my_schema = pa.schema([('DPRD_ID', pa.int64()), ('partition_column', pa.int32()), ('arrays', pa.list_(pa.int32())), ('strings', pa.string()), ('new_column', pa.string())]) df1 = df[df.partition_column==0] df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] table1 = pa.Table.from_pandas(df1, schema=my_schema) table2 = pa.Table.from_pandas(df2, schema=my_schema) pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) pd.read_parquet(PATH_PYARROW_MANUAL) {code} > pa.Table.from_pandas doesn't use schema ordering > > > Key: ARROW-3766 > URL: https://issues.apache.org/jira/browse/ARROW-3766 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Christian Thiel >Priority: Major >
[jira] [Created] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering
Christian Thiel created ARROW-3766: -- Summary: pa.Table.from_pandas doesn't use schema ordering Key: ARROW-3766 URL: https://issues.apache.org/jira/browse/ARROW-3766 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Christian Thiel Pyarrow is sensitive to the order of the columns upon load of partitioned Files. With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object does use the ordering of pandas columns rather than the schema columns. Furthermore it is possible to have columns in the schema but not in the DataFrame (and hence in the resulting pa.Table). This behaviour requires a lot of fiddling with the pandas Frame in the first place if we like to write compatible partitioned files. Hence I argue that for {{pa.Table.from_pandas}}, and any other comparable function, the schema should be the principal source for the Table structure and not the columns and the ordering in the pandas DataFrame. If I specify a schema I simply expect that the resulting Table actually has this schema. Here is a little example. If you remove the reordering of df2 everything works fine: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import os import numpy as np import shutil PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) strings = np.array([np.nan, np.nan, 'a', 'b']) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df.index.name='DPRD_ID' df['arrays'] = pd.Series(arrays) df['strings'] = pd.Series(strings) my_schema = pa.schema([('DPRD_ID', pa.int64()), ('partition_column', pa.int32()), ('arrays', pa.list_(pa.int32())), ('strings', pa.string()), ('new_column', pa.string())]) df1 = df[df.partition_column==0] df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']] table1 = pa.Table.from_pandas(df1, schema=my_schema) table2 = pa.Table.from_pandas(df2, schema=my_schema) pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa')) pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa')) pd.read_parquet(PATH_PYARROW_MANUAL) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)