Issue with writing null values to complex type.

2019-04-29 Thread shyam narayan singh
Hi

I have encountered a regression for writing nulls to the complex type. I
have moved from parquet 1.8.x to 1.12 recently.

Here is what I found out.

My dataset has 111k null values to be written to a complex type. Earlier
with 1.8.x, it would create single page but with 1.12 it creates 20 pages
(parquet - 1414).

Writing nulls to complex types has been optimised to be cached (null cache)
that would be flushed on next non null encounter or explicit flush/close.
With 1.8, it would have encountered explicit close and flush the null cache
and write the page. But with 1.12, after encountering 20k values, the page
is written prematurely. Below is the metadata dump in both cases.

1.8 :

index._id TV=111396 RL=0 DL=2

page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not
defined] SZ:8 VC:111396

1.12 :

index._index TV=111396 RL=0 DL=2


page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this
column] SZ:4 VC:0
..
page 19:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this
column] SZ:8 VC:111396

All the pages in 1.12 except the last page have same metadata. Now the
issue is when the parquet reader kicks in, it sees that the RLE is bit
packed and reads 8 bytes which goes beyond the stream as the size is only 4
(Reading past RLE/BitPacking stream).

For any page write, I thinking the null cache should be flushed.

For now, I have increased the row count limit to INT_MAX that negates
everything done for parquet-1414. Any implications ?

Please let me know the next steps on it.

Regards
Shyam


[jira] [Commented] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829769#comment-16829769
 ] 

Deepak Majeti commented on PARQUET-1405:


Filed https://issues.apache.org/jira/browse/ARROW-5241

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829735#comment-16829735
 ] 

Wes McKinney commented on PARQUET-1405:
---

We can add an option to not write statistics, can you open an ARROW JIRA about 
this?

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829711#comment-16829711
 ] 

Deepak Majeti edited comment on PARQUET-1405 at 4/29/19 8:59 PM:
-

PARQUET-979 omits large statistics inside ColumnMetaData but missed omitting 
large statistics inside the DataPageHeader. I will fix this.
Disabling statistics when writing is a workaround, but I don't see any option 
to disable statistics in the Python API.
[~wesmckinn] or [~xhochy] must correct me here.


was (Author: mdeepak):
PARQUET-979 omits large statistics inside ColumnMetaData but missed omitting 
large statistics inside the DataPageHeader. I will fix this.
Disabling statistics is a workaround, but I don't see any option to disable 
statistics in the Python API.
[~wesmckinn] or [~xhochy] must correct me here.

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829711#comment-16829711
 ] 

Deepak Majeti commented on PARQUET-1405:


PARQUET-979 omits large statistics inside ColumnMetaData but missed omitting 
large statistics inside the DataPageHeader. I will fix this.
Disabling statistics is a workaround, but I don't see any option to disable 
statistics in the Python API.
[~wesmckinn] or [~xhochy] must correct me here.

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1405:
--

Assignee: Deepak Majeti

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Key signing (was: [VOTE] Release Apache Parquet 1.11.0 RC6)

2019-04-29 Thread Zoltan Ivanfi
Hi,

A video call sounds more secure to me than a photo which can be easily
manipulated. We could spend 5 minutes on it in the next Parquet sync
or alternatively is there someone already in the web of trust who
would volunteer to do a private video call with us before or after the
sync?

Thanks,

Zoltan


On Mon, Apr 29, 2019 at 7:52 PM Wes McKinney  wrote:
>
> On Mon, Apr 29, 2019 at 12:48 PM Zoltan Ivanfi  
> wrote:
> >
> > Hi,
> >
> > An excerpt from
> > https://www.apache.org/dev/release-signing#verifying-signature : "A
> > signature is valid, if gpg verifies the .asc as a good signature, and
> > doesn't complain about expired or revoked keys." Another excerpt from
> > https://www.apache.org/dev/release-signing#check-integrity that
> > reinforces that signing each other's keys is optional: "If you are
> > connected to the Apache web of trust then this also offers superior
> > security."
> >
> > That being said I support signing each other's keys. Of course, you
> > will still need one key somewhere along the signing chain that you
> > trust. I see that a few PMC members have signed keys, how should we
> > approach this task? The HOWTO suggests public conferences and key
> > signing parties, but I hope there is a way to do that remotely. Would
> > members who are already in the web of trust feel comfortable signing
> > our keys based the on the following?
> >
> > - Our keys have been committed to the central KEYS file using our
> > apache credentials.
> > - We could personally confirm this in the next Parquet sync.
> > - We could even read the key ID-s out loud if needed.
> >
>
> In person is best (if it is a person whose identity you are sure of),
> for people I know personally what I've done to sign their key remotely
> is have them write down the PGP fingerprint and show the paper to me
> in a photograph of themselves or in a video call. I don't know whether
> this is a good security practice but it seems better than doing things
> over e-mail =)
>
> - Wes
>
> > Br,
> >
> > Zoltan
> >
> >
> > On Mon, Apr 29, 2019 at 7:11 PM Zoltan Ivanfi  wrote:
> > >
> > > Hi Wes,
> > >
> > > Gabor's key is in the KEYS file available at 
> > > https://dist.apache.org/repos/dist/dev/parquet/KEYS Others may correct me 
> > > if I'm mistaken, but as far as I know, this is all that is required. I 
> > > mentioned this in the verification steps as well ("4. Verify the 
> > > signature by running `gpg --verify apache-parquet-1.11.0.tar.gz.asc`. It 
> > > should say "Good signature", the warning about the key not being trusted 
> > > can be ignored"). My signing key is also unsigned, because instead of 
> > > signing each other's keys we depend on the fact that only privileged 
> > > users can put their key into the central KEYS file.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Mon, Apr 29, 2019 at 6:46 PM Wes McKinney  wrote:
> > >>
> > >> -1
> > >>
> > >> Gabor's PGP key is unsigned.
> > >>
> > >> $ gpg --verify apache-parquet-1.11.0.tar.gz.asc
> > >> gpg: assuming signed data in 'apache-parquet-1.11.0.tar.gz'
> > >> gpg: Signature made Tue 19 Mar 2019 08:55:48 AM CDT
> > >> gpg:using RSA key 
> > >> 6FB82970311551C7CEF131F5021057DBF048F543
> > >> gpg: Good signature from "Gabor Szadovszky " [unknown]
> > >> gpg: WARNING: This key is not certified with a trusted signature!
> > >> gpg:  There is no indication that the signature belongs to the 
> > >> owner.
> > >> Primary key fingerprint: 6FB8 2970 3115 51C7 CEF1  31F5 0210 57DB F048 
> > >> F543
> > >>
> > >> On Tue, Apr 16, 2019 at 4:10 AM Gabor Szadovszky  
> > >> wrote:
> > >> >
> > >> > Based on our release process (
> > >> > http://parquet.apache.org/documentation/how-to-release/) and the 
> > >> > related
> > >> > scripts we use the final tag for an RC. So, the existence of this tag 
> > >> > does
> > >> > not mean 1.11.0 is released.
> > >> > However, I agree this is misleading and not a good practice to remove
> > >> > already committed tags and re-add them to another place (when a new RC
> > >> > comes out). I think, we should update our release process to use RC 
> > >> > tags
> > >> > and put the final tag only after it is officially released. But it is 
> > >> > the
> > >> > story of the next release...
> > >> >
> > >> >
> > >> > On Sat, Apr 13, 2019 at 8:00 PM 俊杰陈  wrote:
> > >> >
> > >> > > From the github release page, I see the 1.11.0 already released. Is 
> > >> > > it
> > >> > > still a rc version?
> > >> > > https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.11.0
> > >> > >
> > >> > > On Fri, Apr 12, 2019 at 8:10 AM Ryan Blue 
> > >> > > wrote:
> > >> > >
> > >> > > > Personally, I haven't had enough time to devote to Parquet lately 
> > >> > > > and
> > >> > > that
> > >> > > > means I haven't validated that this release's new features are 
> > >> > > > okay to
> > >> > > > release. I'm hoping sometime in the next few weeks I'll be able to 
> > >> > > > vote
> > >> > > on
> > >> > > > this.
> > >> > > >
> 

Re: [VOTE] Release Apache Parquet 1.11.0 RC6

2019-04-29 Thread Wes McKinney
On Mon, Apr 29, 2019 at 12:48 PM Zoltan Ivanfi  
wrote:
>
> Hi,
>
> An excerpt from
> https://www.apache.org/dev/release-signing#verifying-signature : "A
> signature is valid, if gpg verifies the .asc as a good signature, and
> doesn't complain about expired or revoked keys." Another excerpt from
> https://www.apache.org/dev/release-signing#check-integrity that
> reinforces that signing each other's keys is optional: "If you are
> connected to the Apache web of trust then this also offers superior
> security."
>
> That being said I support signing each other's keys. Of course, you
> will still need one key somewhere along the signing chain that you
> trust. I see that a few PMC members have signed keys, how should we
> approach this task? The HOWTO suggests public conferences and key
> signing parties, but I hope there is a way to do that remotely. Would
> members who are already in the web of trust feel comfortable signing
> our keys based the on the following?
>
> - Our keys have been committed to the central KEYS file using our
> apache credentials.
> - We could personally confirm this in the next Parquet sync.
> - We could even read the key ID-s out loud if needed.
>

In person is best (if it is a person whose identity you are sure of),
for people I know personally what I've done to sign their key remotely
is have them write down the PGP fingerprint and show the paper to me
in a photograph of themselves or in a video call. I don't know whether
this is a good security practice but it seems better than doing things
over e-mail =)

- Wes

> Br,
>
> Zoltan
>
>
> On Mon, Apr 29, 2019 at 7:11 PM Zoltan Ivanfi  wrote:
> >
> > Hi Wes,
> >
> > Gabor's key is in the KEYS file available at 
> > https://dist.apache.org/repos/dist/dev/parquet/KEYS Others may correct me 
> > if I'm mistaken, but as far as I know, this is all that is required. I 
> > mentioned this in the verification steps as well ("4. Verify the signature 
> > by running `gpg --verify apache-parquet-1.11.0.tar.gz.asc`. It should say 
> > "Good signature", the warning about the key not being trusted can be 
> > ignored"). My signing key is also unsigned, because instead of signing each 
> > other's keys we depend on the fact that only privileged users can put their 
> > key into the central KEYS file.
> >
> > Br,
> >
> > Zoltan
> >
> > On Mon, Apr 29, 2019 at 6:46 PM Wes McKinney  wrote:
> >>
> >> -1
> >>
> >> Gabor's PGP key is unsigned.
> >>
> >> $ gpg --verify apache-parquet-1.11.0.tar.gz.asc
> >> gpg: assuming signed data in 'apache-parquet-1.11.0.tar.gz'
> >> gpg: Signature made Tue 19 Mar 2019 08:55:48 AM CDT
> >> gpg:using RSA key 6FB82970311551C7CEF131F5021057DBF048F543
> >> gpg: Good signature from "Gabor Szadovszky " [unknown]
> >> gpg: WARNING: This key is not certified with a trusted signature!
> >> gpg:  There is no indication that the signature belongs to the 
> >> owner.
> >> Primary key fingerprint: 6FB8 2970 3115 51C7 CEF1  31F5 0210 57DB F048 F543
> >>
> >> On Tue, Apr 16, 2019 at 4:10 AM Gabor Szadovszky  wrote:
> >> >
> >> > Based on our release process (
> >> > http://parquet.apache.org/documentation/how-to-release/) and the related
> >> > scripts we use the final tag for an RC. So, the existence of this tag 
> >> > does
> >> > not mean 1.11.0 is released.
> >> > However, I agree this is misleading and not a good practice to remove
> >> > already committed tags and re-add them to another place (when a new RC
> >> > comes out). I think, we should update our release process to use RC tags
> >> > and put the final tag only after it is officially released. But it is the
> >> > story of the next release...
> >> >
> >> >
> >> > On Sat, Apr 13, 2019 at 8:00 PM 俊杰陈  wrote:
> >> >
> >> > > From the github release page, I see the 1.11.0 already released. Is it
> >> > > still a rc version?
> >> > > https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.11.0
> >> > >
> >> > > On Fri, Apr 12, 2019 at 8:10 AM Ryan Blue 
> >> > > wrote:
> >> > >
> >> > > > Personally, I haven't had enough time to devote to Parquet lately and
> >> > > that
> >> > > > means I haven't validated that this release's new features are okay 
> >> > > > to
> >> > > > release. I'm hoping sometime in the next few weeks I'll be able to 
> >> > > > vote
> >> > > on
> >> > > > this.
> >> > > >
> >> > > > On Thu, Apr 11, 2019 at 1:23 PM Andy Grove  
> >> > > > wrote:
> >> > > >
> >> > > > > I'm curious if there is any update on this vote? The thread seems
> >> > > eerily
> >> > > > > quiet.
> >> > > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > > On 4/3/19, 10:38 AM, "Andy Grove"  wrote:
> >> > > > >
> >> > > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
> >> > > > >
> >> > > > >
> >> > > > > I have been able to run mvn verify and have also tested this RC
> >> > > > > against our internal systems, with no issue.
> >> > > > >
> >> > > > > +1 (non-binding)
> >> > > > >
> >> > > > > I have raised the issue about Hadoop-lzo, but 

Re: [VOTE] Release Apache Parquet 1.11.0 RC6

2019-04-29 Thread Zoltan Ivanfi
Hi,

An excerpt from
https://www.apache.org/dev/release-signing#verifying-signature : "A
signature is valid, if gpg verifies the .asc as a good signature, and
doesn't complain about expired or revoked keys." Another excerpt from
https://www.apache.org/dev/release-signing#check-integrity that
reinforces that signing each other's keys is optional: "If you are
connected to the Apache web of trust then this also offers superior
security."

That being said I support signing each other's keys. Of course, you
will still need one key somewhere along the signing chain that you
trust. I see that a few PMC members have signed keys, how should we
approach this task? The HOWTO suggests public conferences and key
signing parties, but I hope there is a way to do that remotely. Would
members who are already in the web of trust feel comfortable signing
our keys based the on the following?

- Our keys have been committed to the central KEYS file using our
apache credentials.
- We could personally confirm this in the next Parquet sync.
- We could even read the key ID-s out loud if needed.

Br,

Zoltan


On Mon, Apr 29, 2019 at 7:11 PM Zoltan Ivanfi  wrote:
>
> Hi Wes,
>
> Gabor's key is in the KEYS file available at 
> https://dist.apache.org/repos/dist/dev/parquet/KEYS Others may correct me if 
> I'm mistaken, but as far as I know, this is all that is required. I mentioned 
> this in the verification steps as well ("4. Verify the signature by running 
> `gpg --verify apache-parquet-1.11.0.tar.gz.asc`. It should say "Good 
> signature", the warning about the key not being trusted can be ignored"). My 
> signing key is also unsigned, because instead of signing each other's keys we 
> depend on the fact that only privileged users can put their key into the 
> central KEYS file.
>
> Br,
>
> Zoltan
>
> On Mon, Apr 29, 2019 at 6:46 PM Wes McKinney  wrote:
>>
>> -1
>>
>> Gabor's PGP key is unsigned.
>>
>> $ gpg --verify apache-parquet-1.11.0.tar.gz.asc
>> gpg: assuming signed data in 'apache-parquet-1.11.0.tar.gz'
>> gpg: Signature made Tue 19 Mar 2019 08:55:48 AM CDT
>> gpg:using RSA key 6FB82970311551C7CEF131F5021057DBF048F543
>> gpg: Good signature from "Gabor Szadovszky " [unknown]
>> gpg: WARNING: This key is not certified with a trusted signature!
>> gpg:  There is no indication that the signature belongs to the owner.
>> Primary key fingerprint: 6FB8 2970 3115 51C7 CEF1  31F5 0210 57DB F048 F543
>>
>> On Tue, Apr 16, 2019 at 4:10 AM Gabor Szadovszky  wrote:
>> >
>> > Based on our release process (
>> > http://parquet.apache.org/documentation/how-to-release/) and the related
>> > scripts we use the final tag for an RC. So, the existence of this tag does
>> > not mean 1.11.0 is released.
>> > However, I agree this is misleading and not a good practice to remove
>> > already committed tags and re-add them to another place (when a new RC
>> > comes out). I think, we should update our release process to use RC tags
>> > and put the final tag only after it is officially released. But it is the
>> > story of the next release...
>> >
>> >
>> > On Sat, Apr 13, 2019 at 8:00 PM 俊杰陈  wrote:
>> >
>> > > From the github release page, I see the 1.11.0 already released. Is it
>> > > still a rc version?
>> > > https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.11.0
>> > >
>> > > On Fri, Apr 12, 2019 at 8:10 AM Ryan Blue 
>> > > wrote:
>> > >
>> > > > Personally, I haven't had enough time to devote to Parquet lately and
>> > > that
>> > > > means I haven't validated that this release's new features are okay to
>> > > > release. I'm hoping sometime in the next few weeks I'll be able to vote
>> > > on
>> > > > this.
>> > > >
>> > > > On Thu, Apr 11, 2019 at 1:23 PM Andy Grove  wrote:
>> > > >
>> > > > > I'm curious if there is any update on this vote? The thread seems
>> > > eerily
>> > > > > quiet.
>> > > > >
>> > > > > Thanks.
>> > > > >
>> > > > > On 4/3/19, 10:38 AM, "Andy Grove"  wrote:
>> > > > >
>> > > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
>> > > > >
>> > > > >
>> > > > > I have been able to run mvn verify and have also tested this RC
>> > > > > against our internal systems, with no issue.
>> > > > >
>> > > > > +1 (non-binding)
>> > > > >
>> > > > > I have raised the issue about Hadoop-lzo, but that is present in
>> > > the
>> > > > > 1.10.1 release also.
>> > > > >
>> > > > > Andy.
>> > > > >
>> > > > >
>> > > > > On 3/20/19, 7:50 AM, "Zoltan Ivanfi" 
>> > > > wrote:
>> > > > >
>> > > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
>> > > > >
>> > > > >
>> > > > > +1 (binding)
>> > > > >
>> > > > > signature matches
>> > > > > git hash matches the git tag
>> > > > > source tarball matches the git tag
>> > > > > unit tests and integration tests pass
>> > > > >
>> > > > > On Tue, Mar 19, 2019 at 3:00 PM Gabor Szadovszky <
>> > > > ga...@apache.org>
>> > > > > wrote:
>> > > > >
>> > > > > > Dear Parquet Users and 

Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Ivan Sadikov
Yeah, you are right. Looks like the right JIRA ticket.

On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher 
wrote:

> Would that be covered by PARQUET-458 (
> https://issues.apache.org/jira/browse/PARQUET-458)?
>
> On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney  wrote:
>
> > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> >
> > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher 
> > wrote:
> > >
> > > But the data page is decoded only after it is decompressed, so I
> > wouldn’t expect an unsupported data page to cause a decompression
> failure.
> > >
> > > (I am playing with adding V2 support to Parquet.Net.)
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov 
> > wrote:
> > > >
> > > > If you are referring to the file in Apache/parquet-testing
> repository,
> > it
> > > > is a valid Parquet file with data encoded into data page v2.
> > > >
> > > > You can easily test it with “cargo install parquet” and “parquet-read
> > > > filepath”.
> > > >
> > > > I am not sure what kind of code you have written, but the error you
> > have
> > > > encountered could be related to the fact that parquet-cpp does not
> > support
> > > > decoding of data page v2.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Ivan
> > > >
> > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> c...@hagenlocher.org
> > >
> > > > wrote:
> > > >
> > > >> To the best of my ability to tell, there is invalid Snappy data in
> > the file
> > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read
> > it with
> > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > >>
> > > >> Thanks!
> > > >> -Curt
> > > >>
> >
>


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Ivan Sadikov
Not in V2, in V1 the whole page is encoded, but in V2 it is only values, if
I remember correctly. So we would have to extract repetition and definition
levels bytes and then decode values.

You can check out code in parquet rust module!

I am not sure about parquet-cpp, we can use that implementation as
reference there.


On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher 
wrote:

> Would that be covered by PARQUET-458 (
> https://issues.apache.org/jira/browse/PARQUET-458)?
>
> On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney  wrote:
>
> > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> >
> > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher 
> > wrote:
> > >
> > > But the data page is decoded only after it is decompressed, so I
> > wouldn’t expect an unsupported data page to cause a decompression
> failure.
> > >
> > > (I am playing with adding V2 support to Parquet.Net.)
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov 
> > wrote:
> > > >
> > > > If you are referring to the file in Apache/parquet-testing
> repository,
> > it
> > > > is a valid Parquet file with data encoded into data page v2.
> > > >
> > > > You can easily test it with “cargo install parquet” and “parquet-read
> > > > filepath”.
> > > >
> > > > I am not sure what kind of code you have written, but the error you
> > have
> > > > encountered could be related to the fact that parquet-cpp does not
> > support
> > > > decoding of data page v2.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Ivan
> > > >
> > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> c...@hagenlocher.org
> > >
> > > > wrote:
> > > >
> > > >> To the best of my ability to tell, there is invalid Snappy data in
> > the file
> > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read
> > it with
> > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > >>
> > > >> Thanks!
> > > >> -Curt
> > > >>
> >
>


Re: [VOTE] Release Apache Parquet 1.11.0 RC6

2019-04-29 Thread Zoltan Ivanfi
Hi Wes,

Gabor's key is in the KEYS file available at
https://dist.apache.org/repos/dist/dev/parquet/KEYS Others may correct me
if I'm mistaken, but as far as I know, this is all that is required. I
mentioned this in the verification steps as well ("4. Verify the signature
by running `gpg --verify apache-parquet-1.11.0.tar.gz.asc`. It should say
"Good signature", the warning about the key not being trusted can be
ignored"). My signing key is also unsigned, because instead of signing each
other's keys we depend on the fact that only privileged users can put their
key into the central KEYS file.

Br,

Zoltan

On Mon, Apr 29, 2019 at 6:46 PM Wes McKinney  wrote:

> -1
>
> Gabor's PGP key is unsigned.
>
> $ gpg --verify apache-parquet-1.11.0.tar.gz.asc
> gpg: assuming signed data in 'apache-parquet-1.11.0.tar.gz'
> gpg: Signature made Tue 19 Mar 2019 08:55:48 AM CDT
> gpg:using RSA key 6FB82970311551C7CEF131F5021057DBF048F543
> gpg: Good signature from "Gabor Szadovszky " [unknown]
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:  There is no indication that the signature belongs to the
> owner.
> Primary key fingerprint: 6FB8 2970 3115 51C7 CEF1  31F5 0210 57DB F048 F543
>
> On Tue, Apr 16, 2019 at 4:10 AM Gabor Szadovszky  wrote:
> >
> > Based on our release process (
> > http://parquet.apache.org/documentation/how-to-release/) and the related
> > scripts we use the final tag for an RC. So, the existence of this tag
> does
> > not mean 1.11.0 is released.
> > However, I agree this is misleading and not a good practice to remove
> > already committed tags and re-add them to another place (when a new RC
> > comes out). I think, we should update our release process to use RC tags
> > and put the final tag only after it is officially released. But it is the
> > story of the next release...
> >
> >
> > On Sat, Apr 13, 2019 at 8:00 PM 俊杰陈  wrote:
> >
> > > From the github release page, I see the 1.11.0 already released. Is it
> > > still a rc version?
> > >
> https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.11.0
> > >
> > > On Fri, Apr 12, 2019 at 8:10 AM Ryan Blue 
> > > wrote:
> > >
> > > > Personally, I haven't had enough time to devote to Parquet lately and
> > > that
> > > > means I haven't validated that this release's new features are okay
> to
> > > > release. I'm hoping sometime in the next few weeks I'll be able to
> vote
> > > on
> > > > this.
> > > >
> > > > On Thu, Apr 11, 2019 at 1:23 PM Andy Grove 
> wrote:
> > > >
> > > > > I'm curious if there is any update on this vote? The thread seems
> > > eerily
> > > > > quiet.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On 4/3/19, 10:38 AM, "Andy Grove"  wrote:
> > > > >
> > > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
> > > > >
> > > > >
> > > > > I have been able to run mvn verify and have also tested this RC
> > > > > against our internal systems, with no issue.
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > I have raised the issue about Hadoop-lzo, but that is present
> in
> > > the
> > > > > 1.10.1 release also.
> > > > >
> > > > > Andy.
> > > > >
> > > > >
> > > > > On 3/20/19, 7:50 AM, "Zoltan Ivanfi" 
> > > > wrote:
> > > > >
> > > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
> > > > >
> > > > >
> > > > > +1 (binding)
> > > > >
> > > > > signature matches
> > > > > git hash matches the git tag
> > > > > source tarball matches the git tag
> > > > > unit tests and integration tests pass
> > > > >
> > > > > On Tue, Mar 19, 2019 at 3:00 PM Gabor Szadovszky <
> > > > ga...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Dear Parquet Users and Developers,
> > > > > >
> > > > > > I propose the following RC to be released as the official
> > > > Apache
> > > > > > Parquet 1.11.0 release:
> > > > > >
> > > > > > The commit id is 9756b0e2b35437a09716707a81e2ac0c187112ed
> > > > > > * This corresponds to the tag: apache-parquet-1.11.0
> > > > > > *
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F9756b0e2b35437a09716707a81e2ac0c187112eddata=02%7C01%7CAndy.Grove%40rms.com%7Cc45463142cfe401f12b708d6b852dac3%7Cd43fb8a804da4990b86cc4ba9ba4511f%7C0%7C0%7C636899063342858310sdata=v6kHzIIpJQp%2Fq7fuR%2ByHVwGV7vZ7lUKupyqKZwmQeFI%3Dreserved=0
> > > > > >
> > > > > > The release tarball, signature, and checksums are here:
> > > > > > *
> > > > > >
> > > > >
> > > >
> > >
> 

Re: [VOTE] Release Apache Parquet 1.11.0 RC6

2019-04-29 Thread Wes McKinney
-1

Gabor's PGP key is unsigned.

$ gpg --verify apache-parquet-1.11.0.tar.gz.asc
gpg: assuming signed data in 'apache-parquet-1.11.0.tar.gz'
gpg: Signature made Tue 19 Mar 2019 08:55:48 AM CDT
gpg:using RSA key 6FB82970311551C7CEF131F5021057DBF048F543
gpg: Good signature from "Gabor Szadovszky " [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 6FB8 2970 3115 51C7 CEF1  31F5 0210 57DB F048 F543

On Tue, Apr 16, 2019 at 4:10 AM Gabor Szadovszky  wrote:
>
> Based on our release process (
> http://parquet.apache.org/documentation/how-to-release/) and the related
> scripts we use the final tag for an RC. So, the existence of this tag does
> not mean 1.11.0 is released.
> However, I agree this is misleading and not a good practice to remove
> already committed tags and re-add them to another place (when a new RC
> comes out). I think, we should update our release process to use RC tags
> and put the final tag only after it is officially released. But it is the
> story of the next release...
>
>
> On Sat, Apr 13, 2019 at 8:00 PM 俊杰陈  wrote:
>
> > From the github release page, I see the 1.11.0 already released. Is it
> > still a rc version?
> > https://github.com/apache/parquet-mr/releases/tag/apache-parquet-1.11.0
> >
> > On Fri, Apr 12, 2019 at 8:10 AM Ryan Blue 
> > wrote:
> >
> > > Personally, I haven't had enough time to devote to Parquet lately and
> > that
> > > means I haven't validated that this release's new features are okay to
> > > release. I'm hoping sometime in the next few weeks I'll be able to vote
> > on
> > > this.
> > >
> > > On Thu, Apr 11, 2019 at 1:23 PM Andy Grove  wrote:
> > >
> > > > I'm curious if there is any update on this vote? The thread seems
> > eerily
> > > > quiet.
> > > >
> > > > Thanks.
> > > >
> > > > On 4/3/19, 10:38 AM, "Andy Grove"  wrote:
> > > >
> > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
> > > >
> > > >
> > > > I have been able to run mvn verify and have also tested this RC
> > > > against our internal systems, with no issue.
> > > >
> > > > +1 (non-binding)
> > > >
> > > > I have raised the issue about Hadoop-lzo, but that is present in
> > the
> > > > 1.10.1 release also.
> > > >
> > > > Andy.
> > > >
> > > >
> > > > On 3/20/19, 7:50 AM, "Zoltan Ivanfi" 
> > > wrote:
> > > >
> > > > CAUTION – UNVERIFIED EXTERNAL EMAIL
> > > >
> > > >
> > > > +1 (binding)
> > > >
> > > > signature matches
> > > > git hash matches the git tag
> > > > source tarball matches the git tag
> > > > unit tests and integration tests pass
> > > >
> > > > On Tue, Mar 19, 2019 at 3:00 PM Gabor Szadovszky <
> > > ga...@apache.org>
> > > > wrote:
> > > >
> > > > > Dear Parquet Users and Developers,
> > > > >
> > > > > I propose the following RC to be released as the official
> > > Apache
> > > > > Parquet 1.11.0 release:
> > > > >
> > > > > The commit id is 9756b0e2b35437a09716707a81e2ac0c187112ed
> > > > > * This corresponds to the tag: apache-parquet-1.11.0
> > > > > *
> > > > >
> > > > >
> > > >
> > >
> > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F9756b0e2b35437a09716707a81e2ac0c187112eddata=02%7C01%7CAndy.Grove%40rms.com%7Cc45463142cfe401f12b708d6b852dac3%7Cd43fb8a804da4990b86cc4ba9ba4511f%7C0%7C0%7C636899063342858310sdata=v6kHzIIpJQp%2Fq7fuR%2ByHVwGV7vZ7lUKupyqKZwmQeFI%3Dreserved=0
> > > > >
> > > > > The release tarball, signature, and checksums are here:
> > > > > *
> > > > >
> > > >
> > >
> > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc6%2Fdata=02%7C01%7CAndy.Grove%40rms.com%7Cc45463142cfe401f12b708d6b852dac3%7Cd43fb8a804da4990b86cc4ba9ba4511f%7C0%7C0%7C636899063342858310sdata=RVlztCju4ZoZz5vnF8f5RxE7kPmZoKMj3Ipo4x0Aj4k%3Dreserved=0
> > > > >
> > > > > You can find the KEYS file here:
> > > > > *
> > > >
> > >
> > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2FKEYSdata=02%7C01%7CAndy.Grove%40rms.com%7Cc45463142cfe401f12b708d6b852dac3%7Cd43fb8a804da4990b86cc4ba9ba4511f%7C0%7C0%7C636899063342858310sdata=8xPAIJ4EkJPXXxZ2hTH%2BuJOtCOrCspYXkjsl%2B44Jb20%3Dreserved=0
> > > > >
> > > > > Binary artifacts are staged in Nexus here:
> > > > > *
> > > > >
> > > > >
> > > >
> > >
> > 

[jira] [Commented] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

2019-04-29 Thread John Adcock (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829371#comment-16829371
 ] 

John Adcock commented on PARQUET-1405:
--

I'm being hit by this issue, I'm happy to try and help resolve it but would 
appreciate any tips on getting started with diagnosing where the problem is.  
Is the column statistics comment above accurate?

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --
>
> Key: PARQUET-1405
> URL: https://issues.apache.org/jira/browse/PARQUET-1405
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>Reporter: Jeremy Heffner
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.6.0
>
> Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns 
> (utf8 strings).  In particular, we were generating WKT representations of 
> polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column 
> containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but 
> fails upon reading the file:
> {code:java}
> ---
> ArrowIOError Traceback (most recent call last)
>  in ()
> > 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in 
> read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> read_parquet(self, path, columns, metadata, schema, nthreads, 
> use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, partitions, open_file_func, file, 
> use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in 
> read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Friendly reminder

2019-04-29 Thread Adam Alami
Dear Apache communities,

Big thanks to those who participated in the survey (great participation from 
the Apache communities). If you haven’t participated, please participate?

What value is there in participating?

I will be sharing the results with the community in the form of a report 
(slides containing results and analysis). For example, the report will answer 
these questions:
- What does the Apache as a community value in PR assessment (e.g. trust, 
relationships, technical expertise, etc.)?
- What is the PR assessment strategy Apache having in place?
- Does the process in place value the contributor, technical expertise or 
social norms? Etc.

Please, participate: https://icse2020.limequery.com/913965?lang=en



Kind regards

Adam Alami
PhD Fellow

IT UNIVERSITY OF COPENHAGEN
Rued Langgaards Vej 7
DK-2300 Copenhagen S
4D04

E-mail: a...@itu.dk
PHONE (+45) 72 18 50 71
www.itu.dk

[cid:image001.jpg@01D31BF2.3AE0C140]



Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Curt Hagenlocher
Would that be covered by PARQUET-458 (
https://issues.apache.org/jira/browse/PARQUET-458)?

On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney  wrote:

> Is there a JIRA issue about data page v2 issues in parquet-cpp?
>
> On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher 
> wrote:
> >
> > But the data page is decoded only after it is decompressed, so I
> wouldn’t expect an unsupported data page to cause a decompression failure.
> >
> > (I am playing with adding V2 support to Parquet.Net.)
> >
> > Sent from my iPhone
> >
> > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov 
> wrote:
> > >
> > > If you are referring to the file in Apache/parquet-testing repository,
> it
> > > is a valid Parquet file with data encoded into data page v2.
> > >
> > > You can easily test it with “cargo install parquet” and “parquet-read
> > > filepath”.
> > >
> > > I am not sure what kind of code you have written, but the error you
> have
> > > encountered could be related to the fact that parquet-cpp does not
> support
> > > decoding of data page v2.
> > >
> > >
> > > Cheers,
> > >
> > > Ivan
> > >
> > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher  >
> > > wrote:
> > >
> > >> To the best of my ability to tell, there is invalid Snappy data in
> the file
> > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read
> it with
> > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > >>
> > >> Thanks!
> > >> -Curt
> > >>
>


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Wes McKinney
Is there a JIRA issue about data page v2 issues in parquet-cpp?

On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher  wrote:
>
> But the data page is decoded only after it is decompressed, so I wouldn’t 
> expect an unsupported data page to cause a decompression failure.
>
> (I am playing with adding V2 support to Parquet.Net.)
>
> Sent from my iPhone
>
> > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov  wrote:
> >
> > If you are referring to the file in Apache/parquet-testing repository, it
> > is a valid Parquet file with data encoded into data page v2.
> >
> > You can easily test it with “cargo install parquet” and “parquet-read
> > filepath”.
> >
> > I am not sure what kind of code you have written, but the error you have
> > encountered could be related to the fact that parquet-cpp does not support
> > decoding of data page v2.
> >
> >
> > Cheers,
> >
> > Ivan
> >
> > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher 
> > wrote:
> >
> >> To the best of my ability to tell, there is invalid Snappy data in the file
> >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read it with
> >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> >>
> >> Thanks!
> >> -Curt
> >>


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Curt Hagenlocher
But the data page is decoded only after it is decompressed, so I wouldn’t 
expect an unsupported data page to cause a decompression failure.

(I am playing with adding V2 support to Parquet.Net.)

Sent from my iPhone

> On Apr 29, 2019, at 7:30 AM, Ivan Sadikov  wrote:
> 
> If you are referring to the file in Apache/parquet-testing repository, it
> is a valid Parquet file with data encoded into data page v2.
> 
> You can easily test it with “cargo install parquet” and “parquet-read
> filepath”.
> 
> I am not sure what kind of code you have written, but the error you have
> encountered could be related to the fact that parquet-cpp does not support
> decoding of data page v2.
> 
> 
> Cheers,
> 
> Ivan
> 
> On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher 
> wrote:
> 
>> To the best of my ability to tell, there is invalid Snappy data in the file
>> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read it with
>> my own code nor with pyarrow 0.13.0. Is this expected to work?
>> 
>> Thanks!
>> -Curt
>> 


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Ivan Sadikov
If you are referring to the file in Apache/parquet-testing repository, it
is a valid Parquet file with data encoded into data page v2.

You can easily test it with “cargo install parquet” and “parquet-read
filepath”.

I am not sure what kind of code you have written, but the error you have
encountered could be related to the fact that parquet-cpp does not support
decoding of data page v2.


Cheers,

Ivan

On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher 
wrote:

> To the best of my ability to tell, there is invalid Snappy data in the file
> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read it with
> my own code nor with pyarrow 0.13.0. Is this expected to work?
>
> Thanks!
> -Curt
>


Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Curt Hagenlocher
To the best of my ability to tell, there is invalid Snappy data in the file
parquet-testing/data/datapage_v2.snappy.parquet. I can neither read it with
my own code nor with pyarrow 0.13.0. Is this expected to work?

Thanks!
-Curt