date:20181112

[jira] [Created] (ARROW-3776) [Rust] Mark methods that do not perform bounds checking as unsafe

2018-11-12 Thread Paddy Horan (JIRA)

Paddy Horan created ARROW-3776:
--

 Summary: [Rust] Mark methods that do not perform bounds checking 
as unsafe
 Key: ARROW-3776
 URL: https://issues.apache.org/jira/browse/ARROW-3776
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3238.
-
Resolution: Not A Problem

I don't believe there is anything we can fix here

> [Python] Can't read pyarrow string columns in fastparquet
> -
>
> Key: ARROW-3238
> URL: https://issues.apache.org/jira/browse/ARROW-3238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Theo Walker
>Priority: Major
>  Labels: parquet
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in 
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in 
> read_fastparquet
> dff = pf.to_pandas(['A'])
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 426, in to_pandas
> index=index, assign=parts)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 258, in read_row_group
> scheme=self.file_scheme)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 344, in read_row_group
> cats, selfmade, assign=assign)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header 
> check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is 
> read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
> df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
> if os.path.isfile(FILENAME):
> os.remove(FILENAME)
> arrow_table = pyarrow.Table.from_pandas(df)
> arrow_pq.write_table(arrow_table,
> FILENAME,
> use_dictionary=False,
> compression='NONE',
> row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
> print "arrow:"
> table2 = arrow_pq.read_table(FILENAME)
> print table2.to_pandas().head()
> def read_fastparquet():
> print "fastparquet:"
> pf = ParquetFile(FILENAME)
> dff = pf.to_pandas(['A'])
> print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet()
> {code}
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 
> 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contiguous arrays

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3774:

Summary: [C++] Change parquet::arrow::FileReader::ReadRowGroups to read 
into contiguous arrays  (was: [C++] Change 
parquet::arrow::FileReader::ReadRowGroups to read into contigous arrays)

> [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into 
> contiguous arrays
> -
>
> Key: ARROW-3774
> URL: https://issues.apache.org/jira/browse/ARROW-3774
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
>
> Instead of creating a chunk per RowGroup, we should read at least for 
> primitive type into a single, pre-allocated Array. This needs some new 
> functionality in the Record reader classes and thus should be done after 
> https://github.com/apache/parquet-cpp/pull/462 is merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3766) [Python] pa.Table.from_pandas doesn't use schema ordering

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3766:

Summary: [Python] pa.Table.from_pandas doesn't use schema ordering  (was: 
pa.Table.from_pandas doesn't use schema ordering)

> [Python] pa.Table.from_pandas doesn't use schema ordering
> -
>
> Key: ARROW-3766
> URL: https://issues.apache.org/jira/browse/ARROW-3766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> Pyarrow is sensitive to the order of the columns upon load of partitioned 
> Files.
> With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we 
> can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} 
> object does use the ordering of pandas columns rather than the schema 
> columns. Furthermore it is possible to have columns in the schema but not in 
> the DataFrame (and hence in the resulting pa.Table).
> This behaviour requires a lot of fiddling with the pandas Frame in the first 
> place if we like to write compatible partitioned files. Hence I argue that 
> for {{pa.Table.from_pandas}}, and any other comparable function, the schema 
> should be the principal source for the Table structure and not the columns 
> and the ordering in the pandas DataFrame. If I specify a schema I simply 
> expect that the resulting Table actually has this schema.
> Here is a little example. If you remove the reordering of df2 everything 
> works fine:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> df1 = df[df.partition_column==0]
> df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
> table1 = pa.Table.from_pandas(df1, schema=my_schema)
> table2 = pa.Table.from_pandas(df2, schema=my_schema)
> pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
> pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
> pd.read_parquet(PATH_PYARROW_MANUAL)
> {code}
> If 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3775) [C++] Handling Arrow reads that overflow a BinaryArray capacity

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1186 to ARROW-3775:
--

Fix Version/s: (was: cpp-1.5.0)
   0.12.0
  Component/s: (was: parquet-cpp)
   C++
 Workflow: jira  (was: patch-available, re-open possible)
  Key: ARROW-3775  (was: PARQUET-1186)
  Project: Apache Arrow  (was: Parquet)

> [C++] Handling Arrow reads that overflow a BinaryArray capacity
> ---
>
> Key: ARROW-3775
> URL: https://issues.apache.org/jira/browse/ARROW-3775
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> See comment thread in 
> https://stackoverflow.com/questions/48115087/converting-parquetfile-to-pandas-dataframe-with-a-column-with-a-set-of-string-in
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3775) [C++] Handling Arrow reads that overflow a BinaryArray capacity

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3775:

Labels: parquet  (was: )

> [C++] Handling Arrow reads that overflow a BinaryArray capacity
> ---
>
> Key: ARROW-3775
> URL: https://issues.apache.org/jira/browse/ARROW-3775
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> See comment thread in 
> https://stackoverflow.com/questions/48115087/converting-parquetfile-to-pandas-dataframe-with-a-column-with-a-set-of-string-in
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3771) [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to specify chunksize

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3771:

Labels: parquet  (was: )

> [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to 
> specify chunksize
> 
>
> Key: ARROW-3771
> URL: https://issues.apache.org/jira/browse/ARROW-3771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Xianjin YE
>Priority: Minor
>  Labels: parquet
>
> see [https://github.com/apache/parquet-cpp/pull/445] comments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3775) [C++] Handling Arrow reads that overflow a BinaryArray capacity

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684428#comment-16684428
 ] 

Wes McKinney commented on ARROW-3775:
-

Moved to Arrow. I think this might be a duplicate issue

> [C++] Handling Arrow reads that overflow a BinaryArray capacity
> ---
>
> Key: ARROW-3775
> URL: https://issues.apache.org/jira/browse/ARROW-3775
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> See comment thread in 
> https://stackoverflow.com/questions/48115087/converting-parquetfile-to-pandas-dataframe-with-a-column-with-a-set-of-string-in
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3771) [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to specify chunksize

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1257 to ARROW-3771:
--

  Component/s: (was: parquet-cpp)
   C++
External issue ID:   (was: ARROW-2360)
 Workflow: jira  (was: patch-available, re-open possible)
  Key: ARROW-3771  (was: PARQUET-1257)
  Project: Apache Arrow  (was: Parquet)

> [C++] GetRecordBatchReader in parquet/arrow/reader.h should be able to 
> specify chunksize
> 
>
> Key: ARROW-3771
> URL: https://issues.apache.org/jira/browse/ARROW-3771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Xianjin YE
>Priority: Minor
>  Labels: parquet
>
> see [https://github.com/apache/parquet-cpp/pull/445] comments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contigous arrays

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3774:

Summary: [C++] Change parquet::arrow::FileReader::ReadRowGroups to read 
into contigous arrays  (was: [C++] Change 
parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays)

> [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into contigous 
> arrays
> 
>
> Key: ARROW-3774
> URL: https://issues.apache.org/jira/browse/ARROW-3774
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
>
> Instead of creating a chunk per RowGroup, we should read at least for 
> primitive type into a single, pre-allocated Array. This needs some new 
> functionality in the Record reader classes and thus should be done after 
> https://github.com/apache/parquet-cpp/pull/462 is merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3774:

Labels: parquet  (was: )

> [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into 
> continuous arrays
> -
>
> Key: ARROW-3774
> URL: https://issues.apache.org/jira/browse/ARROW-3774
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
>
> Instead of creating a chunk per RowGroup, we should read at least for 
> primitive type into a single, pre-allocated Array. This needs some new 
> functionality in the Record reader classes and thus should be done after 
> https://github.com/apache/parquet-cpp/pull/462 is merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3774) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1393 to ARROW-3774:
--

Fix Version/s: (was: cpp-1.6.0)
  Component/s: (was: parquet-cpp)
   C++
 Workflow: jira  (was: patch-available, re-open possible)
   Issue Type: Improvement  (was: New Feature)
  Key: ARROW-3774  (was: PARQUET-1393)
  Project: Apache Arrow  (was: Parquet)

> [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into 
> continuous arrays
> -
>
> Key: ARROW-3774
> URL: https://issues.apache.org/jira/browse/ARROW-3774
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
>
> Instead of creating a chunk per RowGroup, we should read at least for 
> primitive type into a single, pre-allocated Array. This needs some new 
> functionality in the Record reader classes and thus should be done after 
> https://github.com/apache/parquet-cpp/pull/462 is merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3773) [C++] Remove duplicated AssertArraysEqual code in parquet/arrow/arrow-reader-writer-test.cc

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3773:

Summary: [C++] Remove duplicated AssertArraysEqual code in 
parquet/arrow/arrow-reader-writer-test.cc  (was: [C++] Fix AssertArraysEqual 
call)

> [C++] Remove duplicated AssertArraysEqual code in 
> parquet/arrow/arrow-reader-writer-test.cc
> ---
>
> Key: ARROW-3773
> URL: https://issues.apache.org/jira/browse/ARROW-3773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3773) [C++] Fix AssertArraysEqual call

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3773:

Labels: parquet  (was: )

> [C++] Fix AssertArraysEqual call
> 
>
> Key: ARROW-3773
> URL: https://issues.apache.org/jira/browse/ARROW-3773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3773) [C++] Fix AssertArraysEqual call

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1127 to ARROW-3773:
--

Fix Version/s: (was: cpp-1.5.0)
   0.12.0
Affects Version/s: (was: cpp-1.2.0)
  Component/s: (was: parquet-cpp)
   C++
 Workflow: jira  (was: patch-available, re-open possible)
   Issue Type: Improvement  (was: Bug)
  Key: ARROW-3773  (was: PARQUET-1127)
  Project: Apache Arrow  (was: Parquet)

> [C++] Fix AssertArraysEqual call
> 
>
> Key: ARROW-3773
> URL: https://issues.apache.org/jira/browse/ARROW-3773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684422#comment-16684422
 ] 

Wes McKinney commented on ARROW-3772:
-

Moved issue to Arrow issue tracker

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1324 to ARROW-3772:
--

Fix Version/s: (was: cpp-1.6.0)
   0.13.0
  Component/s: (was: parquet-cpp)
   C++
 Workflow: jira  (was: patch-available, re-open possible)
  Key: ARROW-3772  (was: PARQUET-1324)
  Project: Apache Arrow  (was: Parquet)

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3772:

Labels: parquet  (was: )

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3770) [C++] Validate or add option to validate arrow::Table schema in parquet::arrow::FileWriter::WriteTable

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3770:

Component/s: C++

> [C++] Validate or add option to validate arrow::Table schema in 
> parquet::arrow::FileWriter::WriteTable
> --
>
> Key: ARROW-3770
> URL: https://issues.apache.org/jira/browse/ARROW-3770
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> Failing to validate will cause a segfault when the passed table does not 
> match the schema used to instantiate the writer. See ARROW-2926 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3770) [C++] Validate or add option to validate arrow::Table schema in parquet::arrow::FileWriter::WriteTable

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1362 to ARROW-3770:
--

Fix Version/s: (was: cpp-1.6.0)
  Component/s: (was: parquet-cpp)
 Workflow: jira  (was: patch-available, re-open possible)
  Key: ARROW-3770  (was: PARQUET-1362)
  Project: Apache Arrow  (was: Parquet)

> [C++] Validate or add option to validate arrow::Table schema in 
> parquet::arrow::FileWriter::WriteTable
> --
>
> Key: ARROW-3770
> URL: https://issues.apache.org/jira/browse/ARROW-3770
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> Failing to validate will cause a segfault when the passed table does not 
> match the schema used to instantiate the writer. See ARROW-2926 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3770) [C++] Validate or add option to validate arrow::Table schema in parquet::arrow::FileWriter::WriteTable

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3770:

Labels: parquet  (was: )

> [C++] Validate or add option to validate arrow::Table schema in 
> parquet::arrow::FileWriter::WriteTable
> --
>
> Key: ARROW-3770
> URL: https://issues.apache.org/jira/browse/ARROW-3770
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> Failing to validate will cause a segfault when the passed table does not 
> match the schema used to instantiate the writer. See ARROW-2926 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3769) [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684401#comment-16684401
 ] 

Wes McKinney commented on ARROW-3769:
-

Moved this here from the Parquet JIRA

> [C++] Support reading non-dictionary encoded binary Parquet columns directly 
> as DictionaryArray
> ---
>
> Key: ARROW-3769
> URL: https://issues.apache.org/jira/browse/ARROW-3769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> If the goal is to hash this data anyway into a categorical-type array, then 
> it would be better to offer the option to "push down" the hashing into the 
> Parquet read hot path rather than first fully materializing a dense vector of 
> {{ByteArray}} values, which could use a lot of memory after decompression



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (ARROW-3769) [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved PARQUET-1423 to ARROW-3769:
--

Fix Version/s: (was: cpp-1.6.0)
   0.13.0
  Component/s: (was: parquet-cpp)
   C++
 Workflow: jira  (was: patch-available, re-open possible)
  Key: ARROW-3769  (was: PARQUET-1423)
  Project: Apache Arrow  (was: Parquet)

> [C++] Support reading non-dictionary encoded binary Parquet columns directly 
> as DictionaryArray
> ---
>
> Key: ARROW-3769
> URL: https://issues.apache.org/jira/browse/ARROW-3769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> If the goal is to hash this data anyway into a categorical-type array, then 
> it would be better to offer the option to "push down" the hashing into the 
> Parquet read hot path rather than first fully materializing a dense vector of 
> {{ByteArray}} values, which could use a lot of memory after decompression



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3769) [C++] Support reading non-dictionary encoded binary Parquet columns directly as DictionaryArray

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3769:

Labels: parquet  (was: )

> [C++] Support reading non-dictionary encoded binary Parquet columns directly 
> as DictionaryArray
> ---
>
> Key: ARROW-3769
> URL: https://issues.apache.org/jira/browse/ARROW-3769
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> If the goal is to hash this data anyway into a categorical-type array, then 
> it would be better to offer the option to "push down" the hashing into the 
> Parquet read hot path rather than first fully materializing a dense vector of 
> {{ByteArray}} values, which could use a lot of memory after decompression



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-12 Thread Pindikura Ravindra (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684214#comment-16684214
 ] 

Pindikura Ravindra commented on ARROW-3738:
---

I'm fine with moving date.h to arrow/util, [~wesmckinn]

> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684184#comment-16684184
 ] 

Wes McKinney commented on ARROW-3738:
-

It looks like this has already happened in 
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/date.h. 
I suggest we move {{date.h}} to {{arrow/util}}

[~pravindra] sound ok?

> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-12 Thread Antoine Pitrou (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684181#comment-16684181
 ] 

Antoine Pitrou commented on ARROW-3738:
---

To keep things simple, I suggest we start using a date library. The following 
looks good (and is actually the basis for the future C++20 API): 
https://github.com/HowardHinnant/date

We could simply vendor the {{date.h}} file.

> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3768) [Python] set classpath to hdfs not hadoop executable

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3768:

Summary: [Python] set classpath to hdfs not hadoop executable  (was: set 
classpath to hdfs not hadoop executable)

> [Python] set classpath to hdfs not hadoop executable
> 
>
> Key: ARROW-3768
> URL: https://issues.apache.org/jira/browse/ARROW-3768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Andrew Harris
>Priority: Major
>
> The documentation for connecting to hdfs from pyarrow shows using the `hdfs` 
> executable for setting the CLASSPATH. However in 
> `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My 
> understanding is that we will want `_maybe_set_hadoop_classpath` to set 
> CLASSPATH to the `hdfs` executable as documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3768) [Python] set classpath to hdfs not hadoop executable

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684003#comment-16684003
 ] 

Wes McKinney commented on ARROW-3768:
-

The issue will be resolved/closed once a patch is merged

> [Python] set classpath to hdfs not hadoop executable
> 
>
> Key: ARROW-3768
> URL: https://issues.apache.org/jira/browse/ARROW-3768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Andrew Harris
>Priority: Major
>
> The documentation for connecting to hdfs from pyarrow shows using the `hdfs` 
> executable for setting the CLASSPATH. However in 
> `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My 
> understanding is that we will want `_maybe_set_hadoop_classpath` to set 
> CLASSPATH to the `hdfs` executable as documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (ARROW-3768) set classpath to hdfs not hadoop executable

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-3768:
-

> set classpath to hdfs not hadoop executable
> ---
>
> Key: ARROW-3768
> URL: https://issues.apache.org/jira/browse/ARROW-3768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Andrew Harris
>Priority: Major
>
> The documentation for connecting to hdfs from pyarrow shows using the `hdfs` 
> executable for setting the CLASSPATH. However in 
> `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My 
> understanding is that we will want `_maybe_set_hadoop_classpath` to set 
> CLASSPATH to the `hdfs` executable as documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3768) set classpath to hdfs not hadoop executable

2018-11-12 Thread Andrew Harris (JIRA)

Andrew Harris created ARROW-3768:


 Summary: set classpath to hdfs not hadoop executable
 Key: ARROW-3768
 URL: https://issues.apache.org/jira/browse/ARROW-3768
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Andrew Harris


The documentation for connecting to hdfs from pyarrow shows using the `hdfs` 
executable for setting the CLASSPATH. However in `_maybe_set_hadoop_classpath` 
the `hadoop` executable is being set. My understanding is that we will want 
`_maybe_set_hadoop_classpath` to set CLASSPATH to the `hdfs` executable as 
documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3439) [R] R language bindings for Feather format

2018-11-12 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3439:
--
Labels: pull-request-available  (was: )

> [R] R language bindings for Feather format
> --
>
> Key: ARROW-3439
> URL: https://issues.apache.org/jira/browse/ARROW-3439
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> This will enable work on a "Feather v2" to commence and so that the codebase 
> in github.om/wesm/feather can be finally deprecated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (ARROW-3768) set classpath to hdfs not hadoop executable

2018-11-12 Thread Andrew Harris (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Harris closed ARROW-3768.

Resolution: Fixed

> set classpath to hdfs not hadoop executable
> ---
>
> Key: ARROW-3768
> URL: https://issues.apache.org/jira/browse/ARROW-3768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Andrew Harris
>Priority: Major
>
> The documentation for connecting to hdfs from pyarrow shows using the `hdfs` 
> executable for setting the CLASSPATH. However in 
> `_maybe_set_hadoop_classpath` the `hadoop` executable is being set. My 
> understanding is that we will want `_maybe_set_hadoop_classpath` to set 
> CLASSPATH to the `hdfs` executable as documented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering

2018-11-12 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3766:

Fix Version/s: 0.12.0

> pa.Table.from_pandas doesn't use schema ordering
> 
>
> Key: ARROW-3766
> URL: https://issues.apache.org/jira/browse/ARROW-3766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> Pyarrow is sensitive to the order of the columns upon load of partitioned 
> Files.
> With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we 
> can apply a schema to a dataframe. I noticed that the returned {{pa.Table}} 
> object does use the ordering of pandas columns rather than the schema 
> columns. Furthermore it is possible to have columns in the schema but not in 
> the DataFrame (and hence in the resulting pa.Table).
> This behaviour requires a lot of fiddling with the pandas Frame in the first 
> place if we like to write compatible partitioned files. Hence I argue that 
> for {{pa.Table.from_pandas}}, and any other comparable function, the schema 
> should be the principal source for the Table structure and not the columns 
> and the ordering in the pandas DataFrame. If I specify a schema I simply 
> expect that the resulting Table actually has this schema.
> Here is a little example. If you remove the reordering of df2 everything 
> works fine:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import os
> import numpy as np
> import shutil
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
> shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> strings = np.array([np.nan, np.nan, 'a', 'b'])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df.index.name='DPRD_ID'
> df['arrays'] = pd.Series(arrays)
> df['strings'] = pd.Series(strings)
> my_schema = pa.schema([('DPRD_ID', pa.int64()),
>('partition_column', pa.int32()),
>('arrays', pa.list_(pa.int32())),
>('strings', pa.string()),
>('new_column', pa.string())])
> df1 = df[df.partition_column==0]
> df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]
> table1 = pa.Table.from_pandas(df1, schema=my_schema)
> table2 = pa.Table.from_pandas(df2, schema=my_schema)
> pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
> pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))
> pd.read_parquet(PATH_PYARROW_MANUAL)
> {code}
> If 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683886#comment-16683886
 ] 

Wes McKinney commented on ARROW-3738:
-

Those sound like the right ones.

date64 does not support granularity beyond the resolution of a day. The values 
are supposed to be a multiple of 8640. Some things use this 
millisecond-based representation of calendar dates

A timestamp considers intraday point in time

> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3762) [C++] Arrow table reads error when overflowing capacity of BinaryArray

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683875#comment-16683875
 ] 

Wes McKinney commented on ARROW-3762:
-

No -- there have been lots of cases where people have written on bug reports 
(not necessarily in Apache Arrow) things "This is really causing a problem for 
me, when will it be fixed?" which you did not do, so thank you =) One time the 
comment was "This is a deal breaker for me"

> [C++] Arrow table reads error when overflowing capacity of BinaryArray
> --
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Chris Ellison
>Priority: Major
> Fix For: 0.12.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3767) [C++] Add cast for Null to any type

2018-11-12 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683866#comment-16683866
 ] 

Wes McKinney commented on ARROW-3767:
-

Some memory buffers will have to be allocated to conform to the columnar format 
(for primitive types, string, lists, etc.) but that shouldn't be too bad to 
build. We should try to use as much common utility code for this as possible

> [C++] Add cast for Null to any type
> ---
>
> Key: ARROW-3767
> URL: https://issues.apache.org/jira/browse/ARROW-3767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.13.0
>
>
> Casting a column from NullType to any other type is possible as the resulting 
> array will also be all-null but simply with a different type annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3767) [C++] Add cast for Null to any type

2018-11-12 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created ARROW-3767:
--

 Summary: [C++] Add cast for Null to any type
 Key: ARROW-3767
 URL: https://issues.apache.org/jira/browse/ARROW-3767
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Uwe L. Korn
 Fix For: 0.13.0


Casting a column from NullType to any other type is possible as the resulting 
array will also be all-null but simply with a different type annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-12 Thread Antoine Pitrou (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683527#comment-16683527
 ] 

Antoine Pitrou commented on ARROW-3738:
---

What formats exactly should we allow? I'm leaning towards "{{-MM-DD}}" and 
"{{-MM-DD[ T]hh:mm:ss[Z]}}".

Also, what is the difference between the Arrow "date64" and "timestamp" types?


> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering

2018-11-12 Thread Christian Thiel (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Thiel updated ARROW-3766:
---
Description: 
Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can 
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object 
does use the ordering of pandas columns rather than the schema columns. 
Furthermore it is possible to have columns in the schema but not in the 
DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first 
place if we like to write compatible partitioned files. Hence I argue that for 
{{pa.Table.from_pandas}}, and any other comparable function, the schema should 
be the principal source for the Table structure and not the columns and the 
ordering in the pandas DataFrame. If I specify a schema I simply expect that 
the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works 
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
   ('partition_column', pa.int32()),
   ('arrays', pa.list_(pa.int32())),
   ('strings', pa.string()),
   ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)
{code}

If 


  was:
Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can 
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object 
does use the ordering of pandas columns rather than the schema columns. 
Furthermore it is possible to have columns in the schema but not in the 
DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first 
place if we like to write compatible partitioned files. Hence I argue that for 
{{pa.Table.from_pandas}}, and any other comparable function, the schema should 
be the principal source for the Table structure and not the columns and the 
ordering in the pandas DataFrame. If I specify a schema I simply expect that 
the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works 
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
   ('partition_column', pa.int32()),
   ('arrays', pa.list_(pa.int32())),
   ('strings', pa.string()),
   ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)
{code}



> pa.Table.from_pandas doesn't use schema ordering
> 
>
> Key: ARROW-3766
> URL: https://issues.apache.org/jira/browse/ARROW-3766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Christian Thiel
>Priority: Major
>

[jira] [Created] (ARROW-3766) pa.Table.from_pandas doesn't use schema ordering

2018-11-12 Thread Christian Thiel (JIRA)

Christian Thiel created ARROW-3766:
--

 Summary: pa.Table.from_pandas doesn't use schema ordering
 Key: ARROW-3766
 URL: https://issues.apache.org/jira/browse/ARROW-3766
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Christian Thiel


Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function {{pa.Table.from_pandas(dataframe, schema=my_schema)}} we can 
apply a schema to a dataframe. I noticed that the returned {{pa.Table}} object 
does use the ordering of pandas columns rather than the schema columns. 
Furthermore it is possible to have columns in the schema but not in the 
DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first 
place if we like to write compatible partitioned files. Hence I argue that for 
{{pa.Table.from_pandas}}, and any other comparable function, the schema should 
be the principal source for the Table structure and not the columns and the 
ordering in the pandas DataFrame. If I specify a schema I simply expect that 
the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works 
fine:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
   ('partition_column', pa.int32()),
   ('arrays', pa.list_(pa.int32())),
   ('strings', pa.string()),
   ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

41 matches

Mail list logo