Re: Java API that matches C++ low-level write_batch

2020-05-06 Thread Brian Bowman
To clarify … is there a public Parquet Java API that includes writing 
Definition Levels like the C++ low-level API provides?

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ColumnWriter.java
 appears to be an internal API. Are Definition Levels exposed in a public API, 
and are there Java examples using Definition Levels.

We have a Java use case that needs to write Definition Levels.

Thanks,

Briian

From: Brian Bowman 
Date: Wednesday, May 6, 2020 at 9:52 AM
To: "dev@parquet.apache.org" 
Cc: Karl Moss , Paul Tomas 
Subject: Java API that matches C++ low-level write_batch

Here’s some Parquet low-level API C++ code used to write a batch of IEEE 
doubles in a RowGroup.  Is there a public Java API equivalent for writing 
Parquet files?

456 // Append a RowGroup with a specific number of rows.
457 parquet::RowGroupWriter* rg_writer = file_writer->AppendRowGroup();

502auto *double_writer =
503   static_cast(rg_writer->NextColumn());
504
505double_writer->WriteBatch(num_rows, defLevels, nullptr, (double *) 
double_rows);

Thanks,

Brian




Java API that matches C++ low-level write_batch

2020-05-06 Thread Brian Bowman
Here’s some Parquet low-level API C++ code used to write a batch of IEEE 
doubles in a RowGroup.  Is there a public Java API equivalent for writing 
Parquet files?

456 // Append a RowGroup with a specific number of rows.
457 parquet::RowGroupWriter* rg_writer = file_writer->AppendRowGroup();

502auto *double_writer =
503   static_cast(rg_writer->NextColumn());
504
505double_writer->WriteBatch(num_rows, defLevels, nullptr, (double *) 
double_rows);

Thanks,

Brian




Re: Java/Go APIs - Writing Definition Level Info

2020-04-13 Thread Brian Bowman
Forgive the interruption. I found the Java and Go APIs/code.

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Brian Bowman 
Sent: Saturday, April 11, 2020 4:33:32 PM
To: dev@parquet.apache.org 
Cc: Karl Moss ; Jason Secosky 
Subject: Java/Go APIs - Writing Definition Level Info


Can someone familiar with the Java/Go Parquet code please reply with API code 
links/doc for writing/reading column data and corresponding definition levels.  
 We are doing this with the Parquet C++ low-level APIs as a means to represent 
NULL values (max_definition_level = 1 where 0 is NULL/1 is a value) and 
exploring the same possibility with Java and Go.



Thanks,



Brian


Java/Go APIs - Writing Definition Level Info

2020-04-11 Thread Brian Bowman
Can someone familiar with the Java/Go Parquet code please reply with API code 
links/doc for writing/reading column data and corresponding definition levels.  
 We are doing this with the Parquet C++ low-level APIs as a means to represent 
NULL values (max_definition_level = 1 where 0 is NULL/1 is a value) and 
exploring the same possibility with Java and Go.

Thanks,

Brian


Re: Dictionary Decoding for BYTE_ARRAY types

2019-10-12 Thread Brian Bowman
Thanks Wes,

I'm getting the per-Row Group MAX/MIN BYTE_ARRAY values back.   Are the MAX/MIN 
maximum lengths for each BYTE_ARRAY columns also stored?

For example, the following BYTE_ARRAY column with three "Canadian Province" 
values:

MIN = "Alberta"

   "British Columbia"

MAX ="Saskatchewan"

"British Columbia" is the longest value (16) though it's not a MIN/MAX value.  
Is this maximum length (e.g. 16 in this example) of each BYTE_ARRAY column 
value stored in any Parquet column-scoped metadata?

Thanks,


Brian

On 9/12/19, 6:10 PM, "Wes McKinney"  wrote:

EXTERNAL

See 
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fcpp%2Fsrc%2Fparquet%2Fmetadata.h%23L120data=02%7C01%7CBrian.Bowman%40sas.com%7C0a6c7aaa3421434a4d9608d737cdfeeb%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637039230199362742sdata=0nRu7wDCSkI8tnqLCO4yEcVRT%2BxvRpIEdp6ME%2BT1gdY%3Dreserved=0

On Thu, Sep 12, 2019 at 4:59 PM Brian Bowman  wrote:
>
> Thanks Wes,
>
> With that in mind, I’m searching for a public API that returns MAX length 
value for ByteArray columns.  Can you point me to an example?
>
> -Brian
>
> On 9/12/19, 5:34 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> The memory references returned by ReadBatch are not guaranteed to
> persist from one function call to the next. So you need to copy the
> ByteArray data into your own data structures before calling ReadBatch
> again.
>
> Column readers for different columns are independent from each other.
> So function calls for column 7 should not affect anything having to do
> with column 4.
>
> On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman  
wrote:
> >
> > All,
> >
> > I’m debugging a low-level API Parquet reader case where the table 
has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
> >
> > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.
> >
> > In the following ReadBatch(), rowsToRead is already set to all rows 
in the Row Group.  The quantity is verified by the return value in values_read.
> >
> >   
byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,_read);
> >
> > Column 4 is dictionary encoded.  Upon return from its ReadBatch() 
call,  the result vector of BYTE_ARRAY descriptors (rolColPtr) has  correct 
len/ptr pairs pointing into a decoded dictionary string – although not from the 
original dictionary vaues in the .parquet file being read.
> >
> > As soon as the the ReadBatch()  call is made for the next 
BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY 
descriptor values for column 4 are trashed.
> >
> > Is this expected behavior or a bug?  If expected, then it seems the 
dictionary values for Column 4 (… or any BYTE_ARRAY column that is 
dictionary-compressed) should be copied and the descriptor vector addresses 
back-patched, BEFORE invoking ReadBatch() again.  Is this the case?
> >
> > Thanks for clarifying,
> >
> >
> > -Brian
> >
> >
> >
> >
>
>




Re: Dictionary Decoding for BYTE_ARRAY types

2019-09-12 Thread Brian Bowman
Thanks Wes,

With that in mind, I’m searching for a public API that returns MAX length value 
for ByteArray columns.  Can you point me to an example?

-Brian

On 9/12/19, 5:34 PM, "Wes McKinney"  wrote:

EXTERNAL

The memory references returned by ReadBatch are not guaranteed to
persist from one function call to the next. So you need to copy the
ByteArray data into your own data structures before calling ReadBatch
again.

Column readers for different columns are independent from each other.
So function calls for column 7 should not affect anything having to do
with column 4.

On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman  wrote:
>
> All,
>
> I’m debugging a low-level API Parquet reader case where the table has 
DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
>
> Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.
>
> In the following ReadBatch(), rowsToRead is already set to all rows in 
the Row Group.  The quantity is verified by the return value in values_read.
>
>   
byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,_read);
>
> Column 4 is dictionary encoded.  Upon return from its ReadBatch() call,  
the result vector of BYTE_ARRAY descriptors (rolColPtr) has  correct len/ptr 
pairs pointing into a decoded dictionary string – although not from the 
original dictionary vaues in the .parquet file being read.
>
> As soon as the the ReadBatch()  call is made for the next BYTE_ARRAY 
column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values 
for column 4 are trashed.
>
> Is this expected behavior or a bug?  If expected, then it seems the 
dictionary values for Column 4 (… or any BYTE_ARRAY column that is 
dictionary-compressed) should be copied and the descriptor vector addresses 
back-patched, BEFORE invoking ReadBatch() again.  Is this the case?
>
> Thanks for clarifying,
>
>
> -Brian
>
>
>
>




Dictionary Decoding for BYTE_ARRAY types

2019-09-12 Thread Brian Bowman
All,

I’m debugging a low-level API Parquet reader case where the table has DOUBLE, 
BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.

Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.

In the following ReadBatch(), rowsToRead is already set to all rows in the Row 
Group.  The quantity is verified by the return value in values_read.

  
byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,_read);

Column 4 is dictionary encoded.  Upon return from its ReadBatch() call,  the 
result vector of BYTE_ARRAY descriptors (rolColPtr) has  correct len/ptr pairs 
pointing into a decoded dictionary string – although not from the original 
dictionary vaues in the .parquet file being read.

As soon as the the ReadBatch()  call is made for the next BYTE_ARRAY column 
(#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for 
column 4 are trashed.

Is this expected behavior or a bug?  If expected, then it seems the dictionary 
values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) 
should be copied and the descriptor vector addresses back-patched, BEFORE 
invoking ReadBatch() again.  Is this the case?

Thanks for clarifying,


-Brian






Re: Workaround for Thrift download ERRORs

2019-07-17 Thread Brian Bowman
This turns out to be a problem with cert validation when cmake sets up to 
download thrift, due to back-level RHEL 6 and Python 2.6.6 on the internal SAS 
build node where this fails.  
We need to update to a newer supported version of Python on these machines.

 https://issues.apache.org/jira/ seems to be down right now.  I'll add a 
comment about this when it comes back.  

-Brian


On 7/15/19, 3:45 PM, "Brian Bowman"  wrote:

EXTERNAL

Wes,

I can reproduce this issue with a post-0.14.0 git-cloned repos on the same 
linux system where I see with the 0.14.0 release download.  The successful 
cmake/make occurs on a Liunx VM where I personally staged all of the built 
tools, built gcc, cmake 3.2+ etc.  This looks like a problem with my build 
environments here at SAS and NOT with the apache arrow/parquet code base.

Just updated https://issues.apache.org/jira/browse/ARROW-5953 with this 
info.   I'll keep this JIRA open for now in case there is something new to 
report that can help others novice (like me) in building arrow/parquet.

Forgive the noise.

-Brian


On 7/15/19, 2:54 PM, "Wes McKinney"  wrote:

EXTERNAL

Sorry, that's not right. That commit was included in 0.14.0. So I'm
not sure what code change could cause what you're seeing

On Mon, Jul 15, 2019 at 1:53 PM Wes McKinney  
wrote:
>
> It might have gotten fixed in
>
> 
https://github.com/apache/arrow/commit/3a37bf29c512b4c72c8da5b2a8657b21548cc47a#diff-d7849d7fb46f0cd405cfe5fd03828fcd
>
    > On Mon, Jul 15, 2019 at 1:50 PM Brian Bowman  
wrote:
> >
> > See:  https://issues.apache.org/jira/browse/ARROW-5953
> >
> > I'll keep digging.
> >
> > -Brian
> >
> > On 7/15/19, 2:45 PM, "Wes McKinney"  wrote:
> >
> > EXTERNAL
> >
> > Brian -- I am concerned the issue is non-deterministic and 
relates to
> > the get_apache_mirror.py script. There may be something we can 
do to
> > make that script more robust, e.g. adding retry logic or some 
fallback
> > to a known mirror. Of course, if you can consistently reproduce 
the
    > > issue that is good to know too
> >
> > - Wes
> >
> > On Mon, Jul 15, 2019 at 1:31 PM Brian Bowman 
 wrote:
> > >
> > > Wes,
> > >
> > > Here are the cmake thrift log lines from a build of 
apache-arrow git clone on 06Jul2019 where cmake successfully downloads thrift.
> > >
> > > -- Checking for module 'thrift'
> > > -- No package 'thrift' found
> > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB)
> > > Building Apache Thrift from source
> > > Downloading Apache Thrift from 
http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz
> > >
> > > Do you still want a JIRA issue entered, given that this git 
clone works and is a bit newer than the arrow-0.14.0 release .tar?
> > >
> > > - Brian
> > >
> > >
> > > On 7/15/19, 12:39 PM, "Wes McKinney"  
wrote:
> > >
> > > EXTERNAL
> > >
> > > hi Brian,
> > >
> > > Can you please open a JIRA issue?
> > >
> > > Does running the "get_apache_mirror.py" script work for 
you by itself?
> > >
> > > $ python cpp/build-support/get_apache_mirror.py
> > > https://www-eu.apache.org/dist/
> > >
> > > - Wes
> > >
> > > On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman 
 wrote:
> > > >
> > > > Is there a workaround for the following error?
> > > >
> > > > requests.exceptions.SSLError: hostname 'www.apache.org' 
doesn't match either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> > > >
> > > > I’ve inflated apache-arrow-0.14.0.tar and the 
thrift-0.12.0.tar.gz is not being found curing cmake.  This results in 
downstream compile errors during make.
> > > >

Re: Workaround for Thrift download ERRORs

2019-07-15 Thread Brian Bowman
Wes,

I can reproduce this issue with a post-0.14.0 git-cloned repos on the same 
linux system where I see with the 0.14.0 release download.  The successful 
cmake/make occurs on a Liunx VM where I personally staged all of the built 
tools, built gcc, cmake 3.2+ etc.  This looks like a problem with my build 
environments here at SAS and NOT with the apache arrow/parquet code base.  

Just updated https://issues.apache.org/jira/browse/ARROW-5953 with this info.   
I'll keep this JIRA open for now in case there is something new to report that 
can help others novice (like me) in building arrow/parquet.

Forgive the noise.

-Brian


On 7/15/19, 2:54 PM, "Wes McKinney"  wrote:

EXTERNAL

Sorry, that's not right. That commit was included in 0.14.0. So I'm
not sure what code change could cause what you're seeing

On Mon, Jul 15, 2019 at 1:53 PM Wes McKinney  wrote:
>
> It might have gotten fixed in
>
> 
https://github.com/apache/arrow/commit/3a37bf29c512b4c72c8da5b2a8657b21548cc47a#diff-d7849d7fb46f0cd405cfe5fd03828fcd
>
> On Mon, Jul 15, 2019 at 1:50 PM Brian Bowman  wrote:
> >
> > See:  https://issues.apache.org/jira/browse/ARROW-5953
> >
> > I'll keep digging.
> >
> > -Brian
> >
> > On 7/15/19, 2:45 PM, "Wes McKinney"  wrote:
> >
> > EXTERNAL
> >
> > Brian -- I am concerned the issue is non-deterministic and relates 
to
> > the get_apache_mirror.py script. There may be something we can do to
> > make that script more robust, e.g. adding retry logic or some 
fallback
> > to a known mirror. Of course, if you can consistently reproduce the
> > issue that is good to know too
> >
> > - Wes
> >
> > On Mon, Jul 15, 2019 at 1:31 PM Brian Bowman  
wrote:
> > >
> > > Wes,
> > >
> > > Here are the cmake thrift log lines from a build of apache-arrow 
git clone on 06Jul2019 where cmake successfully downloads thrift.
> > >
> > > -- Checking for module 'thrift'
> > > -- No package 'thrift' found
> > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB)
> > > Building Apache Thrift from source
> > > Downloading Apache Thrift from 
http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz
> > >
> > > Do you still want a JIRA issue entered, given that this git clone 
works and is a bit newer than the arrow-0.14.0 release .tar?
> > >
> > > - Brian
> > >
> > >
> > > On 7/15/19, 12:39 PM, "Wes McKinney"  wrote:
> > >
> > > EXTERNAL
> > >
> > > hi Brian,
> > >
> > > Can you please open a JIRA issue?
> > >
> > > Does running the "get_apache_mirror.py" script work for you 
by itself?
> > >
> > > $ python cpp/build-support/get_apache_mirror.py
> > > https://www-eu.apache.org/dist/
> > >
> > > - Wes
> > >
> > > On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman 
 wrote:
> > > >
> > > > Is there a workaround for the following error?
> > > >
> > > > requests.exceptions.SSLError: hostname 'www.apache.org' 
doesn't match either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> > > >
> > > > I’ve inflated apache-arrow-0.14.0.tar and the 
thrift-0.12.0.tar.gz is not being found curing cmake.  This results in 
downstream compile errors during make.
> > > >
> > > > Here’s the log info from cmake:
> > > >
> > > > -- Checking for module 'thrift'
> > > > --   No package 'thrift' found
> > > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB 
THRIFT_INCLUDE_DIR THRIFT_COMPILER)
> > > > Building Apache Thrift from source
> > > > Downloading Apache Thrift from Traceback (most recent call 
last):
> > > >   File 
"…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, in 

> > > > suggested_mirror = get_url('https://www.apache.org/dyn/'
> > > >   File 
"…/apache-arrow-0.14

Re: Workaround for Thrift download ERRORs

2019-07-15 Thread Brian Bowman
See:  https://issues.apache.org/jira/browse/ARROW-5953

I'll keep digging.

-Brian

On 7/15/19, 2:45 PM, "Wes McKinney"  wrote:

EXTERNAL

Brian -- I am concerned the issue is non-deterministic and relates to
the get_apache_mirror.py script. There may be something we can do to
make that script more robust, e.g. adding retry logic or some fallback
to a known mirror. Of course, if you can consistently reproduce the
issue that is good to know too

- Wes

On Mon, Jul 15, 2019 at 1:31 PM Brian Bowman  wrote:
>
> Wes,
>
> Here are the cmake thrift log lines from a build of apache-arrow git 
clone on 06Jul2019 where cmake successfully downloads thrift.
>
> -- Checking for module 'thrift'
> -- No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB)
> Building Apache Thrift from source
> Downloading Apache Thrift from 
http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz
>
> Do you still want a JIRA issue entered, given that this git clone works 
and is a bit newer than the arrow-0.14.0 release .tar?
>
> - Brian
>
>
> On 7/15/19, 12:39 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> hi Brian,
>
> Can you please open a JIRA issue?
>
> Does running the "get_apache_mirror.py" script work for you by itself?
>
> $ python cpp/build-support/get_apache_mirror.py
    > https://www-eu.apache.org/dist/
>
> - Wes
>
> On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman  
wrote:
> >
> > Is there a workaround for the following error?
> >
> > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't 
match either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> >
> > I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz 
is not being found curing cmake.  This results in downstream compile errors 
during make.
> >
> > Here’s the log info from cmake:
> >
> > -- Checking for module 'thrift'
> > --   No package 'thrift' found
> > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB 
THRIFT_INCLUDE_DIR THRIFT_COMPILER)
> > Building Apache Thrift from source
> > Downloading Apache Thrift from Traceback (most recent call last):
> >   File 
"…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, in 

> > suggested_mirror = get_url('https://www.apache.org/dyn/'
> >   File 
"…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 27, in 
get_url
> > return requests.get(url).content
> >   File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, 
in get
> > return request('get', url, **kwargs)
> >   File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, 
in request
> > response = session.request(method=method, url=url, **kwargs)
> >   File "/usr/lib/python2.6/site-packages/requests/sessions.py", 
line 464, in request
> > resp = self.send(prep, **send_kwargs)
> >   File "/usr/lib/python2.6/site-packages/requests/sessions.py", 
line 576, in send
> > r = adapter.send(request, **kwargs)
> >   File "/usr/lib/python2.6/site-packages/requests/adapters.py", 
line 431, in send
> > raise SSLError(e, request=request)
> > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't 
match either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> >
> >
> > Thanks,
> >
> >
> > Brian
> >
>
>




Re: Workaround for Thrift download ERRORs

2019-07-15 Thread Brian Bowman
Wes,

Here are the cmake thrift log lines from a build of apache-arrow git clone on 
06Jul2019 where cmake successfully downloads thrift. 
 
-- Checking for module 'thrift'
-- No package 'thrift' found
-- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) 
Building Apache Thrift from source
Downloading Apache Thrift from 
http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz

Do you still want a JIRA issue entered, given that this git clone works and is 
a bit newer than the arrow-0.14.0 release .tar?

- Brian


On 7/15/19, 12:39 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Can you please open a JIRA issue?

Does running the "get_apache_mirror.py" script work for you by itself?

$ python cpp/build-support/get_apache_mirror.py
https://www-eu.apache.org/dist/

- Wes

On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman  wrote:
>
> Is there a workaround for the following error?
>
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match 
either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
>
> I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not 
being found curing cmake.  This results in downstream compile errors during 
make.
>
> Here’s the log info from cmake:
>
> -- Checking for module 'thrift'
> --   No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
THRIFT_COMPILER)
> Building Apache Thrift from source
> Downloading Apache Thrift from Traceback (most recent call last):
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", 
line 38, in 
> suggested_mirror = get_url('https://www.apache.org/dyn/'
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", 
line 27, in get_url
> return requests.get(url).content
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
> return request('get', url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in 
request
> response = session.request(method=method, url=url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, 
in request
> resp = self.send(prep, **send_kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, 
in send
> r = adapter.send(request, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, 
in send
> raise SSLError(e, request=request)
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match 
either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
>
>
> Thanks,
>
>
> Brian
>




Workaround for Thrift download ERRORs

2019-07-15 Thread Brian Bowman
Is there a workaround for the following error?

requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of 
'*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz

I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being 
found curing cmake.  This results in downstream compile errors during make.

Here’s the log info from cmake:

-- Checking for module 'thrift'
--   No package 'thrift' found
-- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
THRIFT_COMPILER)
Building Apache Thrift from source
Downloading Apache Thrift from Traceback (most recent call last):
  File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, 
in 
suggested_mirror = get_url('https://www.apache.org/dyn/'
  File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 27, 
in get_url
return requests.get(url).content
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
return request('get', url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in 
request
resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in 
send
r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in 
send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of 
'*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz


Thanks,


Brian



Need Help With Runtime ERROR in parquet::ColumnReader [C++]

2019-06-25 Thread Brian Bowman
Have any of you seen the following?   It’s occurs intermittently in my test 
cases.  FWIW - this is the only thread reading currently invoking Parquet code.

#0  std::move&> 
(__t=...) at /usr/local/include/c++/7.3.0/bits/move.h:98
#1  0x7f94c86eb8bc in std::__shared_ptr::operator=(std::__shared_ptr&&) (this=0x7f94a0007968,
__r=)  at 
/usr/local/include/c++/7.3.0/bits/shared_ptr_base.h:1213
#2  0x7f94c86e806a in 
std::shared_ptr::operator=(std::shared_ptr&&) 
(this=0x7f94a0007968, __r=)
at /usr/local/include/c++/7.3.0/bits/shared_ptr.h:319
#3  0x7f94c8747af6 in 
parquet::TypedColumnReader 
>::ReadNewPage (this=0x7f94a0007950) at . . . 
/Arrow/cpp/src/parquet/column_reader.cc:324
#4  0x7f94c873e04f in parquet::ColumnReader::HasNext (this=0x7f94a0007950) 
at . . . /Arrow/cpp/src/parquet/column_reader.h:121
#5  0x7f94c87472df in 
parquet::TypedColumnReader 
>::ReadBatch (this=0x7f94a0007950, batch_size=5,  def_levels=0x7f94a52535f8, 
rep_levels=0x0, values=0x7f94a5253020, values_read=0x7f94afb419d8)
   at . . . /Arrow/cpp/src/parquet/column_reader.h:347


Thanks,


Brian


Re: Arrow/Parquet make failing in the Thrift EP

2019-05-31 Thread Brian Bowman
Removing Thrift 0.11 solved the problem.

Thanks Wes!

-Brian

On 5/30/19, 4:47 PM, "Wes McKinney"  wrote:

EXTERNAL

OK, it looks like you have Thrift 0.11 installed somewhere on your
system which is wreaking havoc

 cat 
../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log
make[5]: *** No rule to make target
`/usr/lib/x86_64-linux-gnu/libssl.so', needed by
`lib/libthriftd.so.0.11.0'.  Stop.

Our thirdparty build should be building Thrift 0.12

https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt#L47

On Thu, May 30, 2019 at 3:40 PM Brian Bowman  wrote:
>
> Thanks Wes,
>
> openssl-devel is installed.
>
> yum info openssl-devel
> Loaded plugins: langpacks, product-id, search-disabled-repos, 
subscription-manager
> Installed Packages
> Name: openssl-devel
> Arch: x86_64
> Epoch   : 1
> Version : 1.0.2k
> Release : 8.el7
> Size: 3.1 M
> Repo: installed
> From repo   : anaconda
>  . . .
>
> Available Packages
> Name: openssl-devel
> Arch: i686
> Epoch   : 1
> Version : 1.0.2k
> Release : 8.el7
> Size: 1.5 M
> Repo: RHEL74
>  . . .
>
> I also downloaded and successfully built Thrift.
>
> ll /usr/local/lib/libth*
> -rwxr-xr-x 1 root root  8394200 May 30 14:03 
/usr/local/lib/libthrift-0.12.0.so
> -rw-r--r-- 1 root root 20996002 May 30 14:03 /usr/local/lib/libthrift.a
> -rw-r--r-- 1 root root  1224764 May 30 14:05 
/usr/local/lib/libthrift_c_glib.a
> -rwxr-xr-x 1 root root 1049 May 30 14:05 
/usr/local/lib/libthrift_c_glib.la
> lrwxrwxrwx 1 root root   25 May 30 14:05 
/usr/local/lib/libthrift_c_glib.so -> libthrift_c_glib.so.0.0.0
> lrwxrwxrwx 1 root root   25 May 30 14:05 
/usr/local/lib/libthrift_c_glib.so.0 -> libthrift_c_glib.so.0.0.0
> -rwxr-xr-x 1 root root   669136 May 30 14:05 
/usr/local/lib/libthrift_c_glib.so.0.0.0
> -rwxr-xr-x 1 root root  999 May 30 14:03 /usr/local/lib/libthrift.la
> -rwxr-xr-x 1 root root   716944 May 30 14:03 
/usr/local/lib/libthriftqt-0.12.0.so
> -rw-r--r-- 1 root root  1580088 May 30 14:03 /usr/local/lib/libthriftqt.a
> -rwxr-xr-x 1 root root 1019 May 30 14:03 /usr/local/lib/libthriftqt.la
> lrwxrwxrwx 1 root root   21 May 30 14:03 
/usr/local/lib/libthriftqt.so -> libthriftqt-0.12.0.so
> lrwxrwxrwx 1 root root   19 May 30 14:03 /usr/local/lib/libthrift.so 
-> libthrift-0.12.0.so
> -rwxr-xr-x 1 root root  1175424 May 30 14:03 
/usr/local/lib/libthriftz-0.12.0.so
> -rw-r--r-- 1 root root  2631910 May 30 14:03 /usr/local/lib/libthriftz.a
> -rwxr-xr-x 1 root root  995 May 30 14:03 /usr/local/lib/libthriftz.la
> lrwxrwxrwx 1 root root   20 May 30 14:03 /usr/local/lib/libthriftz.so 
-> libthriftz-0.12.0.so
>
> I'll keep digging.
>
>
> -Brian
>
>
> On 5/30/19, 4:32 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> hi Brian,
>
> Is openssl-devel installed on this system? We don't have any
> OpenSSL-specific code for the Thrift EP build in the Arrow build
> system so you might try to see if you can build Thrift directly from
> source on the system to see if the problem persists
>
> It appears that Thrift tries to build against OpenSSL if CMake can
> detect it on the system
>
> 
https://github.com/apache/thrift/blob/master/build/cmake/DefineOptions.cmake#L76
>
> - Wes
>
> On Thu, May 30, 2019 at 2:41 PM Brian Bowman  
wrote:
> >
> > Just started seeing the following ERROR when compiling 
Arrow/Parquet cpp after restaging my dev environment in RHEL 7.4.
> >
> > Is something incorrect in my CMake setup where the installed 
libssl.so required by thrift cannot be found.
> >
> >  cat 
../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log
> > make[5]: *** No rule to make target 
`/usr/lib/x86_64-linux-gnu/libssl.so', needed by `lib/libthriftd.so.0.11.0'.  
Stop.
> >
> > Any ideas what’s going on?
> >
> > -Brian
>
>




Re: Arrow/Parquet make failing in the Thrift EP

2019-05-30 Thread Brian Bowman
Thanks Wes,

openssl-devel is installed.

yum info openssl-devel
Loaded plugins: langpacks, product-id, search-disabled-repos, 
subscription-manager
Installed Packages
Name: openssl-devel
Arch: x86_64
Epoch   : 1
Version : 1.0.2k
Release : 8.el7
Size: 3.1 M
Repo: installed
From repo   : anaconda
 . . .

Available Packages
Name: openssl-devel
Arch: i686
Epoch   : 1
Version : 1.0.2k
Release : 8.el7
Size: 1.5 M
Repo: RHEL74
 . . .

I also downloaded and successfully built Thrift.

ll /usr/local/lib/libth*
-rwxr-xr-x 1 root root  8394200 May 30 14:03 /usr/local/lib/libthrift-0.12.0.so
-rw-r--r-- 1 root root 20996002 May 30 14:03 /usr/local/lib/libthrift.a
-rw-r--r-- 1 root root  1224764 May 30 14:05 /usr/local/lib/libthrift_c_glib.a
-rwxr-xr-x 1 root root 1049 May 30 14:05 /usr/local/lib/libthrift_c_glib.la
lrwxrwxrwx 1 root root   25 May 30 14:05 /usr/local/lib/libthrift_c_glib.so 
-> libthrift_c_glib.so.0.0.0
lrwxrwxrwx 1 root root   25 May 30 14:05 
/usr/local/lib/libthrift_c_glib.so.0 -> libthrift_c_glib.so.0.0.0
-rwxr-xr-x 1 root root   669136 May 30 14:05 
/usr/local/lib/libthrift_c_glib.so.0.0.0
-rwxr-xr-x 1 root root  999 May 30 14:03 /usr/local/lib/libthrift.la
-rwxr-xr-x 1 root root   716944 May 30 14:03 
/usr/local/lib/libthriftqt-0.12.0.so
-rw-r--r-- 1 root root  1580088 May 30 14:03 /usr/local/lib/libthriftqt.a
-rwxr-xr-x 1 root root 1019 May 30 14:03 /usr/local/lib/libthriftqt.la
lrwxrwxrwx 1 root root   21 May 30 14:03 /usr/local/lib/libthriftqt.so -> 
libthriftqt-0.12.0.so
lrwxrwxrwx 1 root root   19 May 30 14:03 /usr/local/lib/libthrift.so -> 
libthrift-0.12.0.so
-rwxr-xr-x 1 root root  1175424 May 30 14:03 /usr/local/lib/libthriftz-0.12.0.so
-rw-r--r-- 1 root root  2631910 May 30 14:03 /usr/local/lib/libthriftz.a
-rwxr-xr-x 1 root root  995 May 30 14:03 /usr/local/lib/libthriftz.la
lrwxrwxrwx 1 root root   20 May 30 14:03 /usr/local/lib/libthriftz.so -> 
libthriftz-0.12.0.so

I'll keep digging.


-Brian


On 5/30/19, 4:32 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Is openssl-devel installed on this system? We don't have any
OpenSSL-specific code for the Thrift EP build in the Arrow build
system so you might try to see if you can build Thrift directly from
source on the system to see if the problem persists

It appears that Thrift tries to build against OpenSSL if CMake can
detect it on the system


https://github.com/apache/thrift/blob/master/build/cmake/DefineOptions.cmake#L76

- Wes

On Thu, May 30, 2019 at 2:41 PM Brian Bowman  wrote:
>
> Just started seeing the following ERROR when compiling Arrow/Parquet cpp 
after restaging my dev environment in RHEL 7.4.
>
> Is something incorrect in my CMake setup where the installed libssl.so 
required by thrift cannot be found.
>
>  cat 
../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log
> make[5]: *** No rule to make target 
`/usr/lib/x86_64-linux-gnu/libssl.so', needed by `lib/libthriftd.so.0.11.0'.  
Stop.
>
> Any ideas what’s going on?
>
> -Brian




Arrow/Parquet make failing in the Thrift EP

2019-05-30 Thread Brian Bowman
Just started seeing the following ERROR when compiling Arrow/Parquet cpp after 
restaging my dev environment in RHEL 7.4.

Is something incorrect in my CMake setup where the installed libssl.so required 
by thrift cannot be found.

 cat 
../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log
make[5]: *** No rule to make target `/usr/lib/x86_64-linux-gnu/libssl.so', 
needed by `lib/libthriftd.so.0.11.0'.  Stop.

Any ideas what’s going on?

-Brian


Re: Parquet File Naming Convention Standards

2019-05-22 Thread Brian Bowman
 Thanks for the info!

HDFS is only one of many storage platforms (distributed or otherwise) that SAS 
supports.  In general larger physical files (e.g. 100MB to 1GB) with multiple 
RowGroups are also a good thing for our usage cases.  I'm working to get our 
Parquet (C to C++ via libparquet.so) writer to do this.

-Brian

On 5/22/19, 1:21 PM, "Lee, David"  wrote:

EXTERNAL

I'm not a big fan of this convention which is a Spark convention..

A. The files should have at least "foo" in the name. Using PyArrow I would 
create these files as foo.1.parquet, foo.2.parquet, etc..
B. These files are around 3 megs each. For HDFS storage, files should be 
sized to match the HDFS blocksize which is usually set at 128 megs (default) or 
256 megs, 512 megs, 1 gig, etc..

https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

I usually take small parquet files and save them as parquet row groups in a 
larger parquet file to match the HDFS blocksize.

-Original Message-----
    From: Brian Bowman 
Sent: Wednesday, May 22, 2019 8:40 AM
To: dev@parquet.apache.org
Subject: Parquet File Naming Convention Standards

External Email: Use caution with links and attachments


All,

Here is an example .parquet data set saved using pySpark where the 
following files are members of directory: “foo.parquet”:

-rw-r--r--1 sasbpb  r8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r--1 sasbpb  r25632 Mar 26 12:10 
.part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r25356 Mar 26 12:10 
.part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r26300 Mar 26 12:10 
.part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r23728 Mar 26 12:10 
.part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r0 Mar 26 12:10 _SUCCESS
-rw-r--r--1 sasbpb  r  3279617 Mar 26 12:10 
part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3244105 Mar 26 12:10 
part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3365039 Mar 26 12:10 
part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3035960 Mar 26 12:10 
part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet


Questions:

  1.  Is this the “standard” for creating/saving a .parquet data set?
  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is 
the format:
 part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an 
established convention?  Is this documented somewhere?
  3.  Is there a C++ class to create the CRC?


Thanks,


Brian


This message may contain information that is confidential or privileged. If 
you are not the intended recipient, please advise the sender immediately and 
delete this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.




Parquet File Naming Convention Standards

2019-05-22 Thread Brian Bowman
All,

Here is an example .parquet data set saved using pySpark where the following 
files are members of directory: “foo.parquet”:

-rw-r--r--1 sasbpb  r8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r--1 sasbpb  r25632 Mar 26 12:10 
.part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r25356 Mar 26 12:10 
.part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r26300 Mar 26 12:10 
.part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r23728 Mar 26 12:10 
.part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r0 Mar 26 12:10 _SUCCESS
-rw-r--r--1 sasbpb  r  3279617 Mar 26 12:10 
part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3244105 Mar 26 12:10 
part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3365039 Mar 26 12:10 
part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3035960 Mar 26 12:10 
part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet


Questions:

  1.  Is this the “standard” for creating/saving a .parquet data set?
  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is the 
format:
 part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc
an established convention?  Is this documented somewhere?
  3.  Is there a C++ class to create the CRC?


Thanks,


Brian


Re: Definition Levels and Null

2019-05-13 Thread Brian Bowman
Thanks Wes,

We are using the Parquet C++ low-level APIs. 

Our Parquet "adapter" code will translate the SAS "missing" NaN representation 
to the correct position in the int16_t def level vector passed to the Parquet 
low-level writer.   Similarly, this adapter will reconstitute the NaN "missing" 
representation from the def level vector returned from LevelDecoder() at 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_reader.cc#L77
 up through ReadBatch() and ultimately back to SAS.

-Brian

On 5/13/19, 2:48 PM, "Wes McKinney"  wrote:

EXTERNAL

To comment from the Parquet C++ side, we expose two writer APIs

* High level, using Apache Arrow -- use Arrow's bitmap-based
null/valid representation for null values, NaN is NaN
* Low level, produces your own repetition/definition levels

So if you're using the low level API, and you have values like

[1, 2, 3, NULL = NaN, 5]

then you could represent this as

def_levels = [1, 1, 1, 0, 1]
rep_levels = nullptr
values = [1, 2, 3, 5]

If you don't use the definition level encoding of nulls then other
readers will presume the values to be non-null.

On Mon, May 13, 2019 at 1:06 PM Tim Armstrong
 wrote:
>
> > I see that OPTIONAL or REPEATED must be specified as the Repetition type
> for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> SchemaDescriptor::BuildTree method at
> 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> shows how this causes max_def_level to increment.
> That seems right if your data doesn't have any complex types in it,
> max_def_level will always be 0 or 1 depending on whether the column is
> REQUIRED/OPTIONAL. One option, depending on your data model, is to always
> just mark the field as OPTIONAL and provide the def levels. If they're all
> 1 they will compress extremely well. Impala actually does this because
> mostly columns end up being potentially nullable in Impala/Hive data 
model.
>
> > We are using standard Parquet API's via C++/libparquet.co and therefore
> not doing our own Parquet file-format writer/reader.
> Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a 
quick
> look and I guess it does expose the concept of ref/def levels.
>
> > NaNs representing missing values occur frequently in a myriad of SAS use
> cases.  Other data types may be NULL as well, so I'm wondering if using 
def
> level to indicate NULLs is safer (with consideration to other readers) and
> also consumes less memory/storage across the spectrum of Parquet-supported
> data types?
> If I was in your situation, this is what I'd probably do. We're seen a lot
> more inconsistency with handling of NaN between readers.
>
> On Mon, May 13, 2019 at 10:49 AM Brian Bowman  
wrote:
>
> > Tim,
> >
> > Thanks for your detailed reply and especially for pointing the RLE
> > encoding for the def level!
> >
> > Your comment:
> >
> > <<- If the field is required, the max def level is 0, therefore all
> > values
> >are 0, therefore the def levels can be "decoded" from nothing and
> > the def
> >levels can be omitted for the page.>>
> >
> > I see that OPTIONAL or REPEATED must be specified as the Repetition type
> > for columns where def level of 0 indicates NULL and 1 means not NULL.  
The
> > SchemaDescriptor::BuildTree method at
> > 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> > shows how this causes max_def_level to increment.
> >
> > We are using standard Parquet API's via C++/libparquet.co and therefore
> > not doing our own Parquet file-format writer/reader.
> >
> > NaNs representing missing values occur frequently in a myriad of SAS use
> > cases.  Other data types may be NULL as well, so I'm wondering if using 
def
> > level to indicate NULLs is safer (with consideration to other readers) 
and
> > also consumes less memory/storage across the spectrum of 
Parquet-supported
> > data types?
> >
> > Best,
> >
> > Brian
> >
> >
> > On 5/13/19, 1:03 PM, "Tim Armstrong" 
> > wrote:
> >
> > EXTERNAL
> >
> > Parquet float/double values can hold any IEEE floating point value -
> >
> > 
https://github.com/apache/parquet-format/blob/master/src/main

Re: Definition Levels and Null

2019-05-13 Thread Brian Bowman
Tim,

Thanks for your detailed reply and especially for pointing the RLE encoding for 
the def level!

Your comment: 

<<- If the field is required, the max def level is 0, therefore all values
   are 0, therefore the def levels can be "decoded" from nothing and the def
   levels can be omitted for the page.>>

I see that OPTIONAL or REPEATED must be specified as the Repetition type for 
columns where def level of 0 indicates NULL and 1 means not NULL.  The 
SchemaDescriptor::BuildTree method at 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
shows how this causes max_def_level to increment. 

We are using standard Parquet API's via C++/libparquet.co and therefore not 
doing our own Parquet file-format writer/reader.
 
NaNs representing missing values occur frequently in a myriad of SAS use cases. 
 Other data types may be NULL as well, so I'm wondering if using def level to 
indicate NULLs is safer (with consideration to other readers) and also consumes 
less memory/storage across the spectrum of Parquet-supported data types?

Best,

Brian


On 5/13/19, 1:03 PM, "Tim Armstrong"  wrote:

EXTERNAL

Parquet float/double values can hold any IEEE floating point value -

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413.
So there's no reason you can't write NaN to the files. If a reader isn't
handling NaN values correctly, that seems like an issue with that reader,
although I think you're correct in that you're more likely to hit reader
bugs with NaN than NULL. (I may be telling you something you already know,
but thought I'd start with that).

I don't think the Parquet format is opinionated about what NULL vs NaN
means, although I'd assume that NULL means that the data simply wasn't
present, and NaN means that it was the result of a floating point
calculation that resulted in NaN.

The rep/definition level encoding is fairly complex because of the handling
of nested types and the various ways of encoding the sequence of levels.
The way I'd think about it is:

   - If you don't have any complex/nested types, rep levels aren't needed
   and the logical def levels degenerate into 1=not null, 0 = null.
   - The RLE encoding has a bit-width implied by the max def level value -
   if the max-level is 1, 1 bit is needed per value. If it is 0, 0 bits are
   needed per value.
   - If the field is required, the max def level is 0, therefore all values
   are 0, therefore the def levels can be "decoded" from nothing and the def
   levels can be omitted for the page.
   - If the field is nullable, the bit width is 0, therefore each def level
   is logically a bit. However, RLE encoding is applied to the sequence of 
1/0
   levels -
   https://github.com/apache/parquet-format/blob/master/Encodings.md

The last point is where I think your understanding might diverge from the
implementation - the encoded def levels are not simply a bit vector, it's a
more complex hybrid RLE/bit-packed encoding.

If you use one of the existing Parquet libraries it will handle all this
for you - it's a headache to get it all right from scratch.
- Tim


On Mon, May 13, 2019 at 8:43 AM Brian Bowman  wrote:

> All,
>
> I’m working to integrate the historic usage of SAS missing values for IEEE
> doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
> represent floating-point doubles that are “missing,” i.e. NULL in more
> general data management terms.
>
> Of course SAS’ goal is to create .parquet files that are universally
> readable.  Therefore, it appears that the SAS Parquet writer(s) will NOT 
be
> able to write the usual NAN to represent “missing,” because doing so will
> cause a floating point exception for other readers.
>
> Based on the Parquet doc at:
> https://parquet.apache.org/documentation/latest/ and by examining code, I
> understand that Parquet NULL values are indicated by setting 0x000 at the
> definition level vector offset corresponding to each NULL column offset
> value.
>
> Conversely, It appears that the per-column, per page definition level data
> is never written when required is not specified for the column schema.
>
> Is my understanding and Parquet terminology correct here?
>
> Thanks,
>
> Brian
>




Definition Levels and Null (resend)

2019-05-13 Thread Brian Bowman
All,

I’m working to integrate the historic usage of SAS missing values for IEEE 
doubles into our SAS Viya Parquet integration.  SAS writes a NAN to represent 
floating-point doubles that are “missing,” i.e. NULL in more general data 
management terms.

Of course SAS’ goal is to create .parquet files that are universally readable.  
Therefore, it appears that the SAS Parquet writer(s) will NOT be able to write 
the usual NAN to represent “missing,” because doing so will cause a floating 
point exception for other readers.

Based on the Parquet doc at:  https://parquet.apache.org/documentation/latest/ 
and by examining code, I understand that Parquet NULL values are indicated by 
setting 0x000 at the definition level vector offset corresponding to each NULL 
column offset value.

Conversely, It appears that the per-column, per page definition level data is 
never written when required is not specified for the column schema.

Is my understanding and Parquet terminology correct here?

Thanks,

Brian



Definition Levels and Null

2019-05-13 Thread Brian Bowman
All,

I’m working to integrate the historic usage of SAS missing values for IEEE 
doubles into our SAS Viya Parquet integration.  SAS writes a NAN to represent 
floating-point doubles that are “missing,” i.e. NULL in more general data 
management terms.

Of course SAS’ goal is to create .parquet files that are universally readable.  
Therefore, it appears that the SAS Parquet writer(s) will NOT be able to write 
the usual NAN to represent “missing,” because doing so will cause a floating 
point exception for other readers.

Based on the Parquet doc at:  https://parquet.apache.org/documentation/latest/ 
and by examining code, I understand that Parquet NULL values are indicated by 
setting 0x000 at the definition level vector offset corresponding to each NULL 
column offset value.

Conversely, It appears that the per-column, per page definition level data is 
never written when required is not specified for the column schema.

Is my understanding and Parquet terminology correct here?

Thanks,

Brian


Parquet vs. other Open Source Columnar Formats

2019-05-09 Thread Brian Bowman
All,

Is it fair to say that Parquet is fast becoming the dominate open source 
columnar storage format?   How do those of you with long-term Hadoop experience 
see this?  For example, is Parquet overtaking ORC and Avro?

Thanks,

Brian


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Brian Bowman
Hello Wes,

Thanks for the info!  I'm working to better understand Parquet/Arrow design and 
development processes.   No hurry for LARGE_BYTE_ARRAY.

-Brian


On 4/26/19, 11:14 AM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

I doubt that such a change could be made on a short time horizon.
Collecting feedback and building consensus (if it is even possible)
with stakeholders would take some time. The appropriate place to have
the discussion is here on the mailing list, though

Thanks

On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman  wrote:
>
> Hello Wes/all,
>
> A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
resorting to other alternatives.  Is this something that could be done in 
Parquet over the next few months?  I have a lot of experience with file 
formats/storage layer internals and can contribute for Parquet C++.
>
> -Brian
>
> On 4/5/19, 3:44 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> hi Brian,
>
> Just to comment from the C++ side -- the 64-bit issue is a limitation
> of the Parquet format itself and not related to the C++
> implementation. It would be possibly interesting to add a
> LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
> doing much the same in Apache Arrow for in-memory)
>
> - Wes
>
> On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  
wrote:
> >
> > I don't think that's what you would want to do. Parquet will 
eventually
> > compress large values, but not after making defensive copies and 
attempting
> > to encode them. In the end, it will be a lot more overhead, plus 
the work
> > to make it possible. I think you'd be much better of compressing 
before
> > storing in Parquet if you expect good compression rates.
> >
> > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  
wrote:
> >
> > > My hope is that these large ByteArray values will encode/compress 
to a
> > > fraction of their original size.  FWIW, cpp/src/parquet/
> > > column_writer.cc/.h has int64_t offset and length fields all over 
the
> > > place.
> > >
> > > External file references to BLOBS is doable but not the elegant,
> > > integrated solution I was hoping for.
> > >
> > > -Brian
> > >
> > > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> > >
> > > *EXTERNAL*
> > > Looks like we will need a new encoding for this:
> > > https://github.com/apache/parquet-format/blob/master/Encodings.md
> > >
> > > That doc specifies that the plain encoding uses a 4-byte length. 
That's
> > > not going to be a quick fix.
> > >
> > > Now that I'm thinking about this a bit more, does it make sense 
to support
> > > byte arrays that are more than 2GB? That's far larger than the 
size of a
> > > row group, let alone a page. This would completely break memory 
management
> > > in the JVM implementation.
> > >
>     > > Can you solve this problem using a BLOB type that references an 
external
> > > file with the gigantic values? Seems to me that values this large 
should go
> > > in separate files, not in a Parquet file where it would destroy 
any benefit
> > > from using the format.
> > >
> > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
 wrote:
> > >
> > >> Hello Ryan,
> > >>
> > >> Looks like it's limited by both the Parquet implementation and 
the Thrift
> > >> message methods.  Am I missing anything?
> > >>
> > >> From cpp/src/parquet/types.h
> > >>
> > >> struct ByteArray {
> > >>   ByteArray() : len(0), ptr(NULLPTR) {}
> > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), 
ptr(ptr) {}
> > >>   uint32_t len;
> > >>   const uint8_t* ptr;
> > >> };
> > >>
> > >> From cpp/src/parquet/thrift.h
> > >>
> > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* 
len, T*
> > >> deserialized_msg) {
> 

Re: Parquet Sync

2019-04-20 Thread Brian Bowman
Does the sync happen on Google Hangout?  Could someone please provide a link on 
where to sign up/connect?

Thanks,

Brian

> On Apr 18, 2019, at 12:51 PM, Xinli shang  wrote:
> 
> EXTERNAL
> 
> Hi all,
> 
> Please send your agenda for the next Parquet community sync up meeting. I
> will compile and send the list before the meeting. One of the agenda I have
> so far is encryption.  The meeting will be tentatively at April 30 Tuesday
> 9-10am PT, just like our previous regular meeting time. Please let me know
> if you have any questions for agenda or date/time.
> 
> Xinli
> 
> On Mon, Apr 15, 2019 at 10:54 PM Julien Le Dem
>  wrote:
> 
>> It would be fine to have a rotation.
>> 
>> On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
>> wrote:
>> 
>>> Hi,
>>> 
>>> I'd be happy to help. I have organized a few of these in the past, and
>> I've
>>> recently started similar meetings for the Impala project.
>>> 
>>> If someone else wants to do it, that's fine for me, too, of course.
>>> 
>>> Cheers, Lars
>>> 
>>> On Mon, Apr 15, 2019, 22:14 Julien Le Dem 
>> wrote:
>>> 
 Hello all,
 Since I have been away with the new baby the Parquet syncs have fallen
 behind.
 I'd like a volunteer to run those.
 Responsibilities include taking notes and posting them on the list.
 Also occasionally finding a good time for the meeting.
 Any takers? This could be a rotating duty as well.
 Thank you
 Julien
 
>>> 
>> 
> 
> 
> --
> Xinli Shang


Re: Parquet Sync

2019-04-16 Thread Brian Bowman
All,

I look forward to participating in the upcoming Parquet Syncs.  I'll be happy 
to be a "scribe in rotation" but would first like to participate in a couple of 
Syncs. 

By way of introduction:  I'm Brian Bowman, 34+ year veteran of SAS R  I've 
been working with Parquet Open Source and C++ for the past four months but have 
no prior open source experience.  My career has been programming in Assembly, 
C, Java and SAS, with decades of work in file format design, storage layer 
internals, and scalable distributed access control capabilities.  For the past 
5 years I've been doing core R for Cloud Analytic Services (CAS)  -- the 
modern SAS distributed analytics and data management framework.  I work on the 
CAS distributed table, I/O, and indexing capabilities ... and now Parquet 
integration with CAS.
 
Arrow/Parquet are exciting technologies and I look forward to more work with 
this group as our efforts move ahead.

Best,

Brian

Brian Bowman
Principal Software Developer 
Analytic Server R
SAS Institute Inc.


On 4/16/19, 1:54 AM, "Julien Le Dem"  wrote:

EXTERNAL

It would be fine to have a rotation.

On Mon, Apr 15, 2019 at 10:44 PM Lars Volker 
wrote:

> Hi,
>
> I'd be happy to help. I have organized a few of these in the past, and 
I've
> recently started similar meetings for the Impala project.
>
> If someone else wants to do it, that's fine for me, too, of course.
>
> Cheers, Lars
>
> On Mon, Apr 15, 2019, 22:14 Julien Le Dem  wrote:
>
> > Hello all,
> > Since I have been away with the new baby the Parquet syncs have fallen
> > behind.
> > I'd like a volunteer to run those.
> > Responsibilities include taking notes and posting them on the list.
> > Also occasionally finding a good time for the meeting.
> > Any takers? This could be a rotating duty as well.
> > Thank you
> > Julien
> >
>




Re: Current Parquet Version

2019-04-10 Thread Brian Bowman
Fokko,

Thank you!  I'm not very experienced with GitHub yet and had looked in the 
wrong place.

Best,

Brian 

On 4/9/19, 10:38 PM, "Driesprong, Fokko"  wrote:

EXTERNAL

Hi Brian,

You could take a look at the Github of the Apache Parquet Format itself:
https://github.com/apache/parquet-format

Cheers, Fokko

Op ma 8 apr. 2019 om 20:19 schreef Brian Bowman :

> What is most current Apache Parquet file format version?  Where is this
> designated on the official Apache (or GitHub) site?
>
> Thanks,
>
>
> Brian
>




Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-08 Thread Brian Bowman
Hello Wes/all,

A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
resorting to other alternatives.  Is this something that could be done in 
Parquet over the next few months?  I have a lot of experience with file 
formats/storage layer internals and can contribute for Parquet C++.

-Brian

On 4/5/19, 3:44 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Just to comment from the C++ side -- the 64-bit issue is a limitation
of the Parquet format itself and not related to the C++
implementation. It would be possibly interesting to add a
LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
doing much the same in Apache Arrow for in-memory)

- Wes

On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  wrote:
>
> I don't think that's what you would want to do. Parquet will eventually
> compress large values, but not after making defensive copies and 
attempting
> to encode them. In the end, it will be a lot more overhead, plus the work
> to make it possible. I think you'd be much better of compressing before
> storing in Parquet if you expect good compression rates.
>
> On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  wrote:
>
> > My hope is that these large ByteArray values will encode/compress to a
> > fraction of their original size.  FWIW, cpp/src/parquet/
> > column_writer.cc/.h has int64_t offset and length fields all over the
> > place.
> >
> > External file references to BLOBS is doable but not the elegant,
> > integrated solution I was hoping for.
> >
> > -Brian
> >
> > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> >
> > *EXTERNAL*
> > Looks like we will need a new encoding for this:
> > https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> > That doc specifies that the plain encoding uses a 4-byte length. That's
> > not going to be a quick fix.
> >
> > Now that I'm thinking about this a bit more, does it make sense to 
support
> > byte arrays that are more than 2GB? That's far larger than the size of a
> > row group, let alone a page. This would completely break memory 
management
> > in the JVM implementation.
> >
> > Can you solve this problem using a BLOB type that references an external
> > file with the gigantic values? Seems to me that values this large 
should go
> > in separate files, not in a Parquet file where it would destroy any 
benefit
> > from using the format.
> >
> > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman  
wrote:
> >
> >> Hello Ryan,
> >>
> >> Looks like it's limited by both the Parquet implementation and the 
Thrift
> >> message methods.  Am I missing anything?
> >>
> >> From cpp/src/parquet/types.h
> >>
> >> struct ByteArray {
> >>   ByteArray() : len(0), ptr(NULLPTR) {}
> >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> >>   uint32_t len;
> >>   const uint8_t* ptr;
> >> };
> >>
> >> From cpp/src/parquet/thrift.h
> >>
> >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> >> deserialized_msg) {
> >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
    > >> out)
> >>
> >> -Brian
> >>
> >> On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:
> >>
> >> EXTERNAL
> >>
> >> Hi Brian,
> >>
> >> This seems like something we should allow. What imposes the current
> >> limit?
> >> Is it in the thrift format, or just the implementations?
> >>
> >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
> >> wrote:
> >>
> >> > All,
> >> >
> >> > SAS requires support for storing varying-length character and
> >> binary blobs
> >> > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> >> field is
> >> > a unint32_t.   Looks this the will require incrementing the 
Parquet
> >> file
> >> > format version and changing ByteArray len to uint64_t.
> >> >
> >> > Have there been any requests for this or other Parquet 
developments
> >> that
> >> > require file format versioning changes?
> >> >
> >> > I realize this a non-trivial ask.  Thanks for considering it.
> >> >
> >> > -Brian
> >> >
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix




Current Parquet Version

2019-04-08 Thread Brian Bowman
What is most current Apache Parquet file format version?  Where is this 
designated on the official Apache (or GitHub) site?

Thanks,


Brian


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Thanks Ryan,

After further pondering this, I came to similar conclusions.

Compress the data before putting it into a Parquet ByteArray and if that’s not 
feasible reference it in an external/persisted data structure

Another alternative is to create one or more “shadow columns” to store the 
overflow horizontally.

-Brian

On Apr 5, 2019, at 3:11 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

I don't think that's what you would want to do. Parquet will eventually 
compress large values, but not after making defensive copies and attempting to 
encode them. In the end, it will be a lot more overhead, plus the work to make 
it possible. I think you'd be much better of compressing before storing in 
Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
mailto:rb...@netflix.com.INVALID>> wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
mailto:rb...@netflix.com.INVALID>> wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

    On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h 

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) 

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman  wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
All,

SAS requires support for storing varying-length character and binary blobs with 
a 2^64 max length in Parquet.   Currently, the ByteArray len field is a 
unint32_t.   Looks this the will require incrementing the Parquet file format 
version and changing ByteArray len to uint64_t.

Have there been any requests for this or other Parquet developments that 
require file format versioning changes?

I realize this a non-trivial ask.  Thanks for considering it.

-Brian


Re: Passing File Descriptors in the Low-Level API

2019-03-16 Thread Brian Bowman
Thanks Wes!

I'm working on the integrating and testing the necessary changes in our dev 
environment.  I'll submit a PR once things are working.

Best,

Brian 

On 3/16/19, 4:24 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Please feel free to submit a PR to add the requisite APIs that you
need for your application. Antoine or I or others should be able to
give prompt feedback since we know this code pretty well.

Thanks
Wes

On Sat, Mar 16, 2019 at 11:40 AM Brian Bowman  wrote:
>
> Hi Wes,
>
> Thanks for the quick reply!  To be clear, the usage I'm working on needs 
to own both the Open FileDescriptor and corresponding mapped memory.  In other 
words ...
>
> SAS component does both open() and mmap() which could be for READ or 
WRITE.
>
> -> Calls low-level Parquet APIs to read an existing file or write a new 
one.  The open() and mmap() flags are guaranteed to be correct.
>
> At some later point SAS component does an unmap() and close().
>
> -Brian
>
>
> On 3/14/19, 3:42 PM, "Wes McKinney"  wrote:
>
> hi Brian,
>
> This is mostly an Arrow platform question so I'm copying the Arrow 
mailing list.
>
> You can open a file using an existing file descriptor using 
ReadableFile::Open
>
> 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145
>
> The documentation for this function says:
>
> "The file descriptor becomes owned by the ReadableFile, and will be
> closed on Close() or destruction."
>
> If you want to do the equivalent thing, but using memory mapping, I
> think you'll need to add a corresponding API to MemoryMappedFile. This
> is more perilous because of the API requirements of mmap -- you need
> to pass the right flags and they may need to be the same flags that
> were passed when opening the file descriptor, see
>
> 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378
>
> and
>
>     
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476
>
> - Wes
>
> On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman  
wrote:
> >
> >  The ReadableFile class (arrow/io/file.cc) has utility methods 
where a FileDescriptor is either passed in or returned, but I don’t see how 
this surfaces through the API.
> >
> > Is there a way for application code to control the open lifetime of 
mmap()’d Parquet files by passing an already open FileDescriptor to Parquet 
low-level API open/close methods?
> >
> > Thanks,
> >
> > Brian
> >
>
>
>




Re: Passing File Descriptors in the Low-Level API

2019-03-16 Thread Brian Bowman
Hi Wes,

Thanks for the quick reply!  To be clear, the usage I'm working on needs to own 
both the Open FileDescriptor and corresponding mapped memory.  In other words 
...

SAS component does both open() and mmap() which could be for READ or WRITE.

-> Calls low-level Parquet APIs to read an existing file or write a new one.  
The open() and mmap() flags are guaranteed to be correct.

At some later point SAS component does an unmap() and close(). 

-Brian


On 3/14/19, 3:42 PM, "Wes McKinney"  wrote:

hi Brian,

This is mostly an Arrow platform question so I'm copying the Arrow mailing 
list.

You can open a file using an existing file descriptor using 
ReadableFile::Open

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145

The documentation for this function says:

"The file descriptor becomes owned by the ReadableFile, and will be
closed on Close() or destruction."

If you want to do the equivalent thing, but using memory mapping, I
think you'll need to add a corresponding API to MemoryMappedFile. This
is more perilous because of the API requirements of mmap -- you need
to pass the right flags and they may need to be the same flags that
were passed when opening the file descriptor, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378

and

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476

- Wes

On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman  wrote:
>
>  The ReadableFile class (arrow/io/file.cc) has utility methods where a 
FileDescriptor is either passed in or returned, but I don’t see how this 
surfaces through the API.
>
> Is there a way for application code to control the open lifetime of 
mmap()’d Parquet files by passing an already open FileDescriptor to Parquet 
low-level API open/close methods?
>
> Thanks,
>
> Brian
>





Passing File Descriptors in the Low-Level API

2019-03-14 Thread Brian Bowman
 The ReadableFile class (arrow/io/file.cc) has utility methods where a 
FileDescriptor is either passed in or returned, but I don’t see how this 
surfaces through the API.

Is there a way for application code to control the open lifetime of mmap()’d 
Parquet files by passing an already open FileDescriptor to Parquet low-level 
API open/close methods?

Thanks,

Brian