Re: Java API that matches C++ low-level write_batch
To clarify … is there a public Parquet Java API that includes writing Definition Levels like the C++ low-level API provides? https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ColumnWriter.java appears to be an internal API. Are Definition Levels exposed in a public API, and are there Java examples using Definition Levels. We have a Java use case that needs to write Definition Levels. Thanks, Briian From: Brian Bowman Date: Wednesday, May 6, 2020 at 9:52 AM To: "dev@parquet.apache.org" Cc: Karl Moss , Paul Tomas Subject: Java API that matches C++ low-level write_batch Here’s some Parquet low-level API C++ code used to write a batch of IEEE doubles in a RowGroup. Is there a public Java API equivalent for writing Parquet files? 456 // Append a RowGroup with a specific number of rows. 457 parquet::RowGroupWriter* rg_writer = file_writer->AppendRowGroup(); 502auto *double_writer = 503 static_cast(rg_writer->NextColumn()); 504 505double_writer->WriteBatch(num_rows, defLevels, nullptr, (double *) double_rows); Thanks, Brian
Java API that matches C++ low-level write_batch
Here’s some Parquet low-level API C++ code used to write a batch of IEEE doubles in a RowGroup. Is there a public Java API equivalent for writing Parquet files? 456 // Append a RowGroup with a specific number of rows. 457 parquet::RowGroupWriter* rg_writer = file_writer->AppendRowGroup(); 502auto *double_writer = 503 static_cast(rg_writer->NextColumn()); 504 505double_writer->WriteBatch(num_rows, defLevels, nullptr, (double *) double_rows); Thanks, Brian
Re: Java/Go APIs - Writing Definition Level Info
Forgive the interruption. I found the Java and Go APIs/code. Get Outlook for iOS<https://aka.ms/o0ukef> From: Brian Bowman Sent: Saturday, April 11, 2020 4:33:32 PM To: dev@parquet.apache.org Cc: Karl Moss ; Jason Secosky Subject: Java/Go APIs - Writing Definition Level Info Can someone familiar with the Java/Go Parquet code please reply with API code links/doc for writing/reading column data and corresponding definition levels. We are doing this with the Parquet C++ low-level APIs as a means to represent NULL values (max_definition_level = 1 where 0 is NULL/1 is a value) and exploring the same possibility with Java and Go. Thanks, Brian
Java/Go APIs - Writing Definition Level Info
Can someone familiar with the Java/Go Parquet code please reply with API code links/doc for writing/reading column data and corresponding definition levels. We are doing this with the Parquet C++ low-level APIs as a means to represent NULL values (max_definition_level = 1 where 0 is NULL/1 is a value) and exploring the same possibility with Java and Go. Thanks, Brian
Re: Dictionary Decoding for BYTE_ARRAY types
Thanks Wes, I'm getting the per-Row Group MAX/MIN BYTE_ARRAY values back. Are the MAX/MIN maximum lengths for each BYTE_ARRAY columns also stored? For example, the following BYTE_ARRAY column with three "Canadian Province" values: MIN = "Alberta" "British Columbia" MAX ="Saskatchewan" "British Columbia" is the longest value (16) though it's not a MIN/MAX value. Is this maximum length (e.g. 16 in this example) of each BYTE_ARRAY column value stored in any Parquet column-scoped metadata? Thanks, Brian On 9/12/19, 6:10 PM, "Wes McKinney" wrote: EXTERNAL See https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fcpp%2Fsrc%2Fparquet%2Fmetadata.h%23L120data=02%7C01%7CBrian.Bowman%40sas.com%7C0a6c7aaa3421434a4d9608d737cdfeeb%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637039230199362742sdata=0nRu7wDCSkI8tnqLCO4yEcVRT%2BxvRpIEdp6ME%2BT1gdY%3Dreserved=0 On Thu, Sep 12, 2019 at 4:59 PM Brian Bowman wrote: > > Thanks Wes, > > With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns. Can you point me to an example? > > -Brian > > On 9/12/19, 5:34 PM, "Wes McKinney" wrote: > > EXTERNAL > > The memory references returned by ReadBatch are not guaranteed to > persist from one function call to the next. So you need to copy the > ByteArray data into your own data structures before calling ReadBatch > again. > > Column readers for different columns are independent from each other. > So function calls for column 7 should not affect anything having to do > with column 4. > > On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman wrote: > > > > All, > > > > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types. > > > > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY. > > > > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read. > > > > byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,_read); > > > > Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read. > > > > As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed. > > > > Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case? > > > > Thanks for clarifying, > > > > > > -Brian > > > > > > > > > >
Re: Dictionary Decoding for BYTE_ARRAY types
Thanks Wes, With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns. Can you point me to an example? -Brian On 9/12/19, 5:34 PM, "Wes McKinney" wrote: EXTERNAL The memory references returned by ReadBatch are not guaranteed to persist from one function call to the next. So you need to copy the ByteArray data into your own data structures before calling ReadBatch again. Column readers for different columns are independent from each other. So function calls for column 7 should not affect anything having to do with column 4. On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman wrote: > > All, > > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types. > > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY. > > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read. > > byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,_read); > > Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read. > > As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed. > > Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case? > > Thanks for clarifying, > > > -Brian > > > >
Dictionary Decoding for BYTE_ARRAY types
All, I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types. Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY. In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read. byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,_read); Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read. As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed. Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case? Thanks for clarifying, -Brian
Re: Workaround for Thrift download ERRORs
This turns out to be a problem with cert validation when cmake sets up to download thrift, due to back-level RHEL 6 and Python 2.6.6 on the internal SAS build node where this fails. We need to update to a newer supported version of Python on these machines. https://issues.apache.org/jira/ seems to be down right now. I'll add a comment about this when it comes back. -Brian On 7/15/19, 3:45 PM, "Brian Bowman" wrote: EXTERNAL Wes, I can reproduce this issue with a post-0.14.0 git-cloned repos on the same linux system where I see with the 0.14.0 release download. The successful cmake/make occurs on a Liunx VM where I personally staged all of the built tools, built gcc, cmake 3.2+ etc. This looks like a problem with my build environments here at SAS and NOT with the apache arrow/parquet code base. Just updated https://issues.apache.org/jira/browse/ARROW-5953 with this info. I'll keep this JIRA open for now in case there is something new to report that can help others novice (like me) in building arrow/parquet. Forgive the noise. -Brian On 7/15/19, 2:54 PM, "Wes McKinney" wrote: EXTERNAL Sorry, that's not right. That commit was included in 0.14.0. So I'm not sure what code change could cause what you're seeing On Mon, Jul 15, 2019 at 1:53 PM Wes McKinney wrote: > > It might have gotten fixed in > > https://github.com/apache/arrow/commit/3a37bf29c512b4c72c8da5b2a8657b21548cc47a#diff-d7849d7fb46f0cd405cfe5fd03828fcd > > On Mon, Jul 15, 2019 at 1:50 PM Brian Bowman wrote: > > > > See: https://issues.apache.org/jira/browse/ARROW-5953 > > > > I'll keep digging. > > > > -Brian > > > > On 7/15/19, 2:45 PM, "Wes McKinney" wrote: > > > > EXTERNAL > > > > Brian -- I am concerned the issue is non-deterministic and relates to > > the get_apache_mirror.py script. There may be something we can do to > > make that script more robust, e.g. adding retry logic or some fallback > > to a known mirror. Of course, if you can consistently reproduce the > > issue that is good to know too > > > > - Wes > > > > On Mon, Jul 15, 2019 at 1:31 PM Brian Bowman wrote: > > > > > > Wes, > > > > > > Here are the cmake thrift log lines from a build of apache-arrow git clone on 06Jul2019 where cmake successfully downloads thrift. > > > > > > -- Checking for module 'thrift' > > > -- No package 'thrift' found > > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) > > > Building Apache Thrift from source > > > Downloading Apache Thrift from http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz > > > > > > Do you still want a JIRA issue entered, given that this git clone works and is a bit newer than the arrow-0.14.0 release .tar? > > > > > > - Brian > > > > > > > > > On 7/15/19, 12:39 PM, "Wes McKinney" wrote: > > > > > > EXTERNAL > > > > > > hi Brian, > > > > > > Can you please open a JIRA issue? > > > > > > Does running the "get_apache_mirror.py" script work for you by itself? > > > > > > $ python cpp/build-support/get_apache_mirror.py > > > https://www-eu.apache.org/dist/ > > > > > > - Wes > > > > > > On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman wrote: > > > > > > > > Is there a workaround for the following error? > > > > > > > > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > > > > > > > > I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being found curing cmake. This results in downstream compile errors during make. > > > >
Re: Workaround for Thrift download ERRORs
Wes, I can reproduce this issue with a post-0.14.0 git-cloned repos on the same linux system where I see with the 0.14.0 release download. The successful cmake/make occurs on a Liunx VM where I personally staged all of the built tools, built gcc, cmake 3.2+ etc. This looks like a problem with my build environments here at SAS and NOT with the apache arrow/parquet code base. Just updated https://issues.apache.org/jira/browse/ARROW-5953 with this info. I'll keep this JIRA open for now in case there is something new to report that can help others novice (like me) in building arrow/parquet. Forgive the noise. -Brian On 7/15/19, 2:54 PM, "Wes McKinney" wrote: EXTERNAL Sorry, that's not right. That commit was included in 0.14.0. So I'm not sure what code change could cause what you're seeing On Mon, Jul 15, 2019 at 1:53 PM Wes McKinney wrote: > > It might have gotten fixed in > > https://github.com/apache/arrow/commit/3a37bf29c512b4c72c8da5b2a8657b21548cc47a#diff-d7849d7fb46f0cd405cfe5fd03828fcd > > On Mon, Jul 15, 2019 at 1:50 PM Brian Bowman wrote: > > > > See: https://issues.apache.org/jira/browse/ARROW-5953 > > > > I'll keep digging. > > > > -Brian > > > > On 7/15/19, 2:45 PM, "Wes McKinney" wrote: > > > > EXTERNAL > > > > Brian -- I am concerned the issue is non-deterministic and relates to > > the get_apache_mirror.py script. There may be something we can do to > > make that script more robust, e.g. adding retry logic or some fallback > > to a known mirror. Of course, if you can consistently reproduce the > > issue that is good to know too > > > > - Wes > > > > On Mon, Jul 15, 2019 at 1:31 PM Brian Bowman wrote: > > > > > > Wes, > > > > > > Here are the cmake thrift log lines from a build of apache-arrow git clone on 06Jul2019 where cmake successfully downloads thrift. > > > > > > -- Checking for module 'thrift' > > > -- No package 'thrift' found > > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) > > > Building Apache Thrift from source > > > Downloading Apache Thrift from http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz > > > > > > Do you still want a JIRA issue entered, given that this git clone works and is a bit newer than the arrow-0.14.0 release .tar? > > > > > > - Brian > > > > > > > > > On 7/15/19, 12:39 PM, "Wes McKinney" wrote: > > > > > > EXTERNAL > > > > > > hi Brian, > > > > > > Can you please open a JIRA issue? > > > > > > Does running the "get_apache_mirror.py" script work for you by itself? > > > > > > $ python cpp/build-support/get_apache_mirror.py > > > https://www-eu.apache.org/dist/ > > > > > > - Wes > > > > > > On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman wrote: > > > > > > > > Is there a workaround for the following error? > > > > > > > > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > > > > > > > > I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being found curing cmake. This results in downstream compile errors during make. > > > > > > > > Here’s the log info from cmake: > > > > > > > > -- Checking for module 'thrift' > > > > -- No package 'thrift' found > > > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR THRIFT_COMPILER) > > > > Building Apache Thrift from source > > > > Downloading Apache Thrift from Traceback (most recent call last): > > > > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, in > > > > suggested_mirror = get_url('https://www.apache.org/dyn/' > > > > File "…/apache-arrow-0.14
Re: Workaround for Thrift download ERRORs
See: https://issues.apache.org/jira/browse/ARROW-5953 I'll keep digging. -Brian On 7/15/19, 2:45 PM, "Wes McKinney" wrote: EXTERNAL Brian -- I am concerned the issue is non-deterministic and relates to the get_apache_mirror.py script. There may be something we can do to make that script more robust, e.g. adding retry logic or some fallback to a known mirror. Of course, if you can consistently reproduce the issue that is good to know too - Wes On Mon, Jul 15, 2019 at 1:31 PM Brian Bowman wrote: > > Wes, > > Here are the cmake thrift log lines from a build of apache-arrow git clone on 06Jul2019 where cmake successfully downloads thrift. > > -- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) > Building Apache Thrift from source > Downloading Apache Thrift from http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz > > Do you still want a JIRA issue entered, given that this git clone works and is a bit newer than the arrow-0.14.0 release .tar? > > - Brian > > > On 7/15/19, 12:39 PM, "Wes McKinney" wrote: > > EXTERNAL > > hi Brian, > > Can you please open a JIRA issue? > > Does running the "get_apache_mirror.py" script work for you by itself? > > $ python cpp/build-support/get_apache_mirror.py > https://www-eu.apache.org/dist/ > > - Wes > > On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman wrote: > > > > Is there a workaround for the following error? > > > > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > > > > I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being found curing cmake. This results in downstream compile errors during make. > > > > Here’s the log info from cmake: > > > > -- Checking for module 'thrift' > > -- No package 'thrift' found > > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR THRIFT_COMPILER) > > Building Apache Thrift from source > > Downloading Apache Thrift from Traceback (most recent call last): > > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, in > > suggested_mirror = get_url('https://www.apache.org/dyn/' > > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 27, in get_url > > return requests.get(url).content > > File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get > > return request('get', url, **kwargs) > > File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request > > response = session.request(method=method, url=url, **kwargs) > > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in request > > resp = self.send(prep, **send_kwargs) > > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in send > > r = adapter.send(request, **kwargs) > > File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in send > > raise SSLError(e, request=request) > > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > > > > > > Thanks, > > > > > > Brian > > > >
Re: Workaround for Thrift download ERRORs
Wes, Here are the cmake thrift log lines from a build of apache-arrow git clone on 06Jul2019 where cmake successfully downloads thrift. -- Checking for module 'thrift' -- No package 'thrift' found -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) Building Apache Thrift from source Downloading Apache Thrift from http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz Do you still want a JIRA issue entered, given that this git clone works and is a bit newer than the arrow-0.14.0 release .tar? - Brian On 7/15/19, 12:39 PM, "Wes McKinney" wrote: EXTERNAL hi Brian, Can you please open a JIRA issue? Does running the "get_apache_mirror.py" script work for you by itself? $ python cpp/build-support/get_apache_mirror.py https://www-eu.apache.org/dist/ - Wes On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman wrote: > > Is there a workaround for the following error? > > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > > I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being found curing cmake. This results in downstream compile errors during make. > > Here’s the log info from cmake: > > -- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR THRIFT_COMPILER) > Building Apache Thrift from source > Downloading Apache Thrift from Traceback (most recent call last): > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, in > suggested_mirror = get_url('https://www.apache.org/dyn/' > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 27, in get_url > return requests.get(url).content > File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get > return request('get', url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request > response = session.request(method=method, url=url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in request > resp = self.send(prep, **send_kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in send > r = adapter.send(request, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in send > raise SSLError(e, request=request) > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > > > Thanks, > > > Brian >
Workaround for Thrift download ERRORs
Is there a workaround for the following error? requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being found curing cmake. This results in downstream compile errors during make. Here’s the log info from cmake: -- Checking for module 'thrift' -- No package 'thrift' found -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR THRIFT_COMPILER) Building Apache Thrift from source Downloading Apache Thrift from Traceback (most recent call last): File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, in suggested_mirror = get_url('https://www.apache.org/dyn/' File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 27, in get_url return requests.get(url).content File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get return request('get', url, **kwargs) File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request response = session.request(method=method, url=url, **kwargs) File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in send r = adapter.send(request, **kwargs) File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in send raise SSLError(e, request=request) requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz Thanks, Brian
Need Help With Runtime ERROR in parquet::ColumnReader [C++]
Have any of you seen the following? It’s occurs intermittently in my test cases. FWIW - this is the only thread reading currently invoking Parquet code. #0 std::move&> (__t=...) at /usr/local/include/c++/7.3.0/bits/move.h:98 #1 0x7f94c86eb8bc in std::__shared_ptr::operator=(std::__shared_ptr&&) (this=0x7f94a0007968, __r=) at /usr/local/include/c++/7.3.0/bits/shared_ptr_base.h:1213 #2 0x7f94c86e806a in std::shared_ptr::operator=(std::shared_ptr&&) (this=0x7f94a0007968, __r=) at /usr/local/include/c++/7.3.0/bits/shared_ptr.h:319 #3 0x7f94c8747af6 in parquet::TypedColumnReader >::ReadNewPage (this=0x7f94a0007950) at . . . /Arrow/cpp/src/parquet/column_reader.cc:324 #4 0x7f94c873e04f in parquet::ColumnReader::HasNext (this=0x7f94a0007950) at . . . /Arrow/cpp/src/parquet/column_reader.h:121 #5 0x7f94c87472df in parquet::TypedColumnReader >::ReadBatch (this=0x7f94a0007950, batch_size=5, def_levels=0x7f94a52535f8, rep_levels=0x0, values=0x7f94a5253020, values_read=0x7f94afb419d8) at . . . /Arrow/cpp/src/parquet/column_reader.h:347 Thanks, Brian
Re: Arrow/Parquet make failing in the Thrift EP
Removing Thrift 0.11 solved the problem. Thanks Wes! -Brian On 5/30/19, 4:47 PM, "Wes McKinney" wrote: EXTERNAL OK, it looks like you have Thrift 0.11 installed somewhere on your system which is wreaking havoc cat ../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log make[5]: *** No rule to make target `/usr/lib/x86_64-linux-gnu/libssl.so', needed by `lib/libthriftd.so.0.11.0'. Stop. Our thirdparty build should be building Thrift 0.12 https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt#L47 On Thu, May 30, 2019 at 3:40 PM Brian Bowman wrote: > > Thanks Wes, > > openssl-devel is installed. > > yum info openssl-devel > Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager > Installed Packages > Name: openssl-devel > Arch: x86_64 > Epoch : 1 > Version : 1.0.2k > Release : 8.el7 > Size: 3.1 M > Repo: installed > From repo : anaconda > . . . > > Available Packages > Name: openssl-devel > Arch: i686 > Epoch : 1 > Version : 1.0.2k > Release : 8.el7 > Size: 1.5 M > Repo: RHEL74 > . . . > > I also downloaded and successfully built Thrift. > > ll /usr/local/lib/libth* > -rwxr-xr-x 1 root root 8394200 May 30 14:03 /usr/local/lib/libthrift-0.12.0.so > -rw-r--r-- 1 root root 20996002 May 30 14:03 /usr/local/lib/libthrift.a > -rw-r--r-- 1 root root 1224764 May 30 14:05 /usr/local/lib/libthrift_c_glib.a > -rwxr-xr-x 1 root root 1049 May 30 14:05 /usr/local/lib/libthrift_c_glib.la > lrwxrwxrwx 1 root root 25 May 30 14:05 /usr/local/lib/libthrift_c_glib.so -> libthrift_c_glib.so.0.0.0 > lrwxrwxrwx 1 root root 25 May 30 14:05 /usr/local/lib/libthrift_c_glib.so.0 -> libthrift_c_glib.so.0.0.0 > -rwxr-xr-x 1 root root 669136 May 30 14:05 /usr/local/lib/libthrift_c_glib.so.0.0.0 > -rwxr-xr-x 1 root root 999 May 30 14:03 /usr/local/lib/libthrift.la > -rwxr-xr-x 1 root root 716944 May 30 14:03 /usr/local/lib/libthriftqt-0.12.0.so > -rw-r--r-- 1 root root 1580088 May 30 14:03 /usr/local/lib/libthriftqt.a > -rwxr-xr-x 1 root root 1019 May 30 14:03 /usr/local/lib/libthriftqt.la > lrwxrwxrwx 1 root root 21 May 30 14:03 /usr/local/lib/libthriftqt.so -> libthriftqt-0.12.0.so > lrwxrwxrwx 1 root root 19 May 30 14:03 /usr/local/lib/libthrift.so -> libthrift-0.12.0.so > -rwxr-xr-x 1 root root 1175424 May 30 14:03 /usr/local/lib/libthriftz-0.12.0.so > -rw-r--r-- 1 root root 2631910 May 30 14:03 /usr/local/lib/libthriftz.a > -rwxr-xr-x 1 root root 995 May 30 14:03 /usr/local/lib/libthriftz.la > lrwxrwxrwx 1 root root 20 May 30 14:03 /usr/local/lib/libthriftz.so -> libthriftz-0.12.0.so > > I'll keep digging. > > > -Brian > > > On 5/30/19, 4:32 PM, "Wes McKinney" wrote: > > EXTERNAL > > hi Brian, > > Is openssl-devel installed on this system? We don't have any > OpenSSL-specific code for the Thrift EP build in the Arrow build > system so you might try to see if you can build Thrift directly from > source on the system to see if the problem persists > > It appears that Thrift tries to build against OpenSSL if CMake can > detect it on the system > > https://github.com/apache/thrift/blob/master/build/cmake/DefineOptions.cmake#L76 > > - Wes > > On Thu, May 30, 2019 at 2:41 PM Brian Bowman wrote: > > > > Just started seeing the following ERROR when compiling Arrow/Parquet cpp after restaging my dev environment in RHEL 7.4. > > > > Is something incorrect in my CMake setup where the installed libssl.so required by thrift cannot be found. > > > > cat ../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log > > make[5]: *** No rule to make target `/usr/lib/x86_64-linux-gnu/libssl.so', needed by `lib/libthriftd.so.0.11.0'. Stop. > > > > Any ideas what’s going on? > > > > -Brian > >
Re: Arrow/Parquet make failing in the Thrift EP
Thanks Wes, openssl-devel is installed. yum info openssl-devel Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager Installed Packages Name: openssl-devel Arch: x86_64 Epoch : 1 Version : 1.0.2k Release : 8.el7 Size: 3.1 M Repo: installed From repo : anaconda . . . Available Packages Name: openssl-devel Arch: i686 Epoch : 1 Version : 1.0.2k Release : 8.el7 Size: 1.5 M Repo: RHEL74 . . . I also downloaded and successfully built Thrift. ll /usr/local/lib/libth* -rwxr-xr-x 1 root root 8394200 May 30 14:03 /usr/local/lib/libthrift-0.12.0.so -rw-r--r-- 1 root root 20996002 May 30 14:03 /usr/local/lib/libthrift.a -rw-r--r-- 1 root root 1224764 May 30 14:05 /usr/local/lib/libthrift_c_glib.a -rwxr-xr-x 1 root root 1049 May 30 14:05 /usr/local/lib/libthrift_c_glib.la lrwxrwxrwx 1 root root 25 May 30 14:05 /usr/local/lib/libthrift_c_glib.so -> libthrift_c_glib.so.0.0.0 lrwxrwxrwx 1 root root 25 May 30 14:05 /usr/local/lib/libthrift_c_glib.so.0 -> libthrift_c_glib.so.0.0.0 -rwxr-xr-x 1 root root 669136 May 30 14:05 /usr/local/lib/libthrift_c_glib.so.0.0.0 -rwxr-xr-x 1 root root 999 May 30 14:03 /usr/local/lib/libthrift.la -rwxr-xr-x 1 root root 716944 May 30 14:03 /usr/local/lib/libthriftqt-0.12.0.so -rw-r--r-- 1 root root 1580088 May 30 14:03 /usr/local/lib/libthriftqt.a -rwxr-xr-x 1 root root 1019 May 30 14:03 /usr/local/lib/libthriftqt.la lrwxrwxrwx 1 root root 21 May 30 14:03 /usr/local/lib/libthriftqt.so -> libthriftqt-0.12.0.so lrwxrwxrwx 1 root root 19 May 30 14:03 /usr/local/lib/libthrift.so -> libthrift-0.12.0.so -rwxr-xr-x 1 root root 1175424 May 30 14:03 /usr/local/lib/libthriftz-0.12.0.so -rw-r--r-- 1 root root 2631910 May 30 14:03 /usr/local/lib/libthriftz.a -rwxr-xr-x 1 root root 995 May 30 14:03 /usr/local/lib/libthriftz.la lrwxrwxrwx 1 root root 20 May 30 14:03 /usr/local/lib/libthriftz.so -> libthriftz-0.12.0.so I'll keep digging. -Brian On 5/30/19, 4:32 PM, "Wes McKinney" wrote: EXTERNAL hi Brian, Is openssl-devel installed on this system? We don't have any OpenSSL-specific code for the Thrift EP build in the Arrow build system so you might try to see if you can build Thrift directly from source on the system to see if the problem persists It appears that Thrift tries to build against OpenSSL if CMake can detect it on the system https://github.com/apache/thrift/blob/master/build/cmake/DefineOptions.cmake#L76 - Wes On Thu, May 30, 2019 at 2:41 PM Brian Bowman wrote: > > Just started seeing the following ERROR when compiling Arrow/Parquet cpp after restaging my dev environment in RHEL 7.4. > > Is something incorrect in my CMake setup where the installed libssl.so required by thrift cannot be found. > > cat ../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log > make[5]: *** No rule to make target `/usr/lib/x86_64-linux-gnu/libssl.so', needed by `lib/libthriftd.so.0.11.0'. Stop. > > Any ideas what’s going on? > > -Brian
Arrow/Parquet make failing in the Thrift EP
Just started seeing the following ERROR when compiling Arrow/Parquet cpp after restaging my dev environment in RHEL 7.4. Is something incorrect in my CMake setup where the installed libssl.so required by thrift cannot be found. cat ../Arrow/cpp/cmake-build-debug/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-build-err.log make[5]: *** No rule to make target `/usr/lib/x86_64-linux-gnu/libssl.so', needed by `lib/libthriftd.so.0.11.0'. Stop. Any ideas what’s going on? -Brian
Re: Parquet File Naming Convention Standards
Thanks for the info! HDFS is only one of many storage platforms (distributed or otherwise) that SAS supports. In general larger physical files (e.g. 100MB to 1GB) with multiple RowGroups are also a good thing for our usage cases. I'm working to get our Parquet (C to C++ via libparquet.so) writer to do this. -Brian On 5/22/19, 1:21 PM, "Lee, David" wrote: EXTERNAL I'm not a big fan of this convention which is a Spark convention.. A. The files should have at least "foo" in the name. Using PyArrow I would create these files as foo.1.parquet, foo.2.parquet, etc.. B. These files are around 3 megs each. For HDFS storage, files should be sized to match the HDFS blocksize which is usually set at 128 megs (default) or 256 megs, 512 megs, 1 gig, etc.. https://blog.cloudera.com/blog/2009/02/the-small-files-problem/ I usually take small parquet files and save them as parquet row groups in a larger parquet file to match the HDFS blocksize. -Original Message----- From: Brian Bowman Sent: Wednesday, May 22, 2019 8:40 AM To: dev@parquet.apache.org Subject: Parquet File Naming Convention Standards External Email: Use caution with links and attachments All, Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”: -rw-r--r--1 sasbpb r8 Mar 26 12:10 ._SUCCESS.crc -rw-r--r--1 sasbpb r25632 Mar 26 12:10 .part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r25356 Mar 26 12:10 .part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r26300 Mar 26 12:10 .part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r23728 Mar 26 12:10 .part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r0 Mar 26 12:10 _SUCCESS -rw-r--r--1 sasbpb r 3279617 Mar 26 12:10 part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3244105 Mar 26 12:10 part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3365039 Mar 26 12:10 part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3035960 Mar 26 12:10 part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet Questions: 1. Is this the “standard” for creating/saving a .parquet data set? 2. It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID. Is the format: part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an established convention? Is this documented somewhere? 3. Is there a C++ class to create the CRC? Thanks, Brian This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2019 BlackRock, Inc. All rights reserved.
Parquet File Naming Convention Standards
All, Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”: -rw-r--r--1 sasbpb r8 Mar 26 12:10 ._SUCCESS.crc -rw-r--r--1 sasbpb r25632 Mar 26 12:10 .part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r25356 Mar 26 12:10 .part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r26300 Mar 26 12:10 .part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r23728 Mar 26 12:10 .part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r0 Mar 26 12:10 _SUCCESS -rw-r--r--1 sasbpb r 3279617 Mar 26 12:10 part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3244105 Mar 26 12:10 part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3365039 Mar 26 12:10 part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3035960 Mar 26 12:10 part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet Questions: 1. Is this the “standard” for creating/saving a .parquet data set? 2. It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID. Is the format: part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an established convention? Is this documented somewhere? 3. Is there a C++ class to create the CRC? Thanks, Brian
Re: Definition Levels and Null
Thanks Wes, We are using the Parquet C++ low-level APIs. Our Parquet "adapter" code will translate the SAS "missing" NaN representation to the correct position in the int16_t def level vector passed to the Parquet low-level writer. Similarly, this adapter will reconstitute the NaN "missing" representation from the def level vector returned from LevelDecoder() at https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_reader.cc#L77 up through ReadBatch() and ultimately back to SAS. -Brian On 5/13/19, 2:48 PM, "Wes McKinney" wrote: EXTERNAL To comment from the Parquet C++ side, we expose two writer APIs * High level, using Apache Arrow -- use Arrow's bitmap-based null/valid representation for null values, NaN is NaN * Low level, produces your own repetition/definition levels So if you're using the low level API, and you have values like [1, 2, 3, NULL = NaN, 5] then you could represent this as def_levels = [1, 1, 1, 0, 1] rep_levels = nullptr values = [1, 2, 3, 5] If you don't use the definition level encoding of nulls then other readers will presume the values to be non-null. On Mon, May 13, 2019 at 1:06 PM Tim Armstrong wrote: > > > I see that OPTIONAL or REPEATED must be specified as the Repetition type > for columns where def level of 0 indicates NULL and 1 means not NULL. The > SchemaDescriptor::BuildTree method at > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661 > shows how this causes max_def_level to increment. > That seems right if your data doesn't have any complex types in it, > max_def_level will always be 0 or 1 depending on whether the column is > REQUIRED/OPTIONAL. One option, depending on your data model, is to always > just mark the field as OPTIONAL and provide the def levels. If they're all > 1 they will compress extremely well. Impala actually does this because > mostly columns end up being potentially nullable in Impala/Hive data model. > > > We are using standard Parquet API's via C++/libparquet.co and therefore > not doing our own Parquet file-format writer/reader. > Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick > look and I guess it does expose the concept of ref/def levels. > > > NaNs representing missing values occur frequently in a myriad of SAS use > cases. Other data types may be NULL as well, so I'm wondering if using def > level to indicate NULLs is safer (with consideration to other readers) and > also consumes less memory/storage across the spectrum of Parquet-supported > data types? > If I was in your situation, this is what I'd probably do. We're seen a lot > more inconsistency with handling of NaN between readers. > > On Mon, May 13, 2019 at 10:49 AM Brian Bowman wrote: > > > Tim, > > > > Thanks for your detailed reply and especially for pointing the RLE > > encoding for the def level! > > > > Your comment: > > > > <<- If the field is required, the max def level is 0, therefore all > > values > >are 0, therefore the def levels can be "decoded" from nothing and > > the def > >levels can be omitted for the page.>> > > > > I see that OPTIONAL or REPEATED must be specified as the Repetition type > > for columns where def level of 0 indicates NULL and 1 means not NULL. The > > SchemaDescriptor::BuildTree method at > > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661 > > shows how this causes max_def_level to increment. > > > > We are using standard Parquet API's via C++/libparquet.co and therefore > > not doing our own Parquet file-format writer/reader. > > > > NaNs representing missing values occur frequently in a myriad of SAS use > > cases. Other data types may be NULL as well, so I'm wondering if using def > > level to indicate NULLs is safer (with consideration to other readers) and > > also consumes less memory/storage across the spectrum of Parquet-supported > > data types? > > > > Best, > > > > Brian > > > > > > On 5/13/19, 1:03 PM, "Tim Armstrong" > > wrote: > > > > EXTERNAL > > > > Parquet float/double values can hold any IEEE floating point value - > > > > https://github.com/apache/parquet-format/blob/master/src/main
Re: Definition Levels and Null
Tim, Thanks for your detailed reply and especially for pointing the RLE encoding for the def level! Your comment: <<- If the field is required, the max def level is 0, therefore all values are 0, therefore the def levels can be "decoded" from nothing and the def levels can be omitted for the page.>> I see that OPTIONAL or REPEATED must be specified as the Repetition type for columns where def level of 0 indicates NULL and 1 means not NULL. The SchemaDescriptor::BuildTree method at https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661 shows how this causes max_def_level to increment. We are using standard Parquet API's via C++/libparquet.co and therefore not doing our own Parquet file-format writer/reader. NaNs representing missing values occur frequently in a myriad of SAS use cases. Other data types may be NULL as well, so I'm wondering if using def level to indicate NULLs is safer (with consideration to other readers) and also consumes less memory/storage across the spectrum of Parquet-supported data types? Best, Brian On 5/13/19, 1:03 PM, "Tim Armstrong" wrote: EXTERNAL Parquet float/double values can hold any IEEE floating point value - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413. So there's no reason you can't write NaN to the files. If a reader isn't handling NaN values correctly, that seems like an issue with that reader, although I think you're correct in that you're more likely to hit reader bugs with NaN than NULL. (I may be telling you something you already know, but thought I'd start with that). I don't think the Parquet format is opinionated about what NULL vs NaN means, although I'd assume that NULL means that the data simply wasn't present, and NaN means that it was the result of a floating point calculation that resulted in NaN. The rep/definition level encoding is fairly complex because of the handling of nested types and the various ways of encoding the sequence of levels. The way I'd think about it is: - If you don't have any complex/nested types, rep levels aren't needed and the logical def levels degenerate into 1=not null, 0 = null. - The RLE encoding has a bit-width implied by the max def level value - if the max-level is 1, 1 bit is needed per value. If it is 0, 0 bits are needed per value. - If the field is required, the max def level is 0, therefore all values are 0, therefore the def levels can be "decoded" from nothing and the def levels can be omitted for the page. - If the field is nullable, the bit width is 0, therefore each def level is logically a bit. However, RLE encoding is applied to the sequence of 1/0 levels - https://github.com/apache/parquet-format/blob/master/Encodings.md The last point is where I think your understanding might diverge from the implementation - the encoded def levels are not simply a bit vector, it's a more complex hybrid RLE/bit-packed encoding. If you use one of the existing Parquet libraries it will handle all this for you - it's a headache to get it all right from scratch. - Tim On Mon, May 13, 2019 at 8:43 AM Brian Bowman wrote: > All, > > I’m working to integrate the historic usage of SAS missing values for IEEE > doubles into our SAS Viya Parquet integration. SAS writes a NAN to > represent floating-point doubles that are “missing,” i.e. NULL in more > general data management terms. > > Of course SAS’ goal is to create .parquet files that are universally > readable. Therefore, it appears that the SAS Parquet writer(s) will NOT be > able to write the usual NAN to represent “missing,” because doing so will > cause a floating point exception for other readers. > > Based on the Parquet doc at: > https://parquet.apache.org/documentation/latest/ and by examining code, I > understand that Parquet NULL values are indicated by setting 0x000 at the > definition level vector offset corresponding to each NULL column offset > value. > > Conversely, It appears that the per-column, per page definition level data > is never written when required is not specified for the column schema. > > Is my understanding and Parquet terminology correct here? > > Thanks, > > Brian >
Definition Levels and Null (resend)
All, I’m working to integrate the historic usage of SAS missing values for IEEE doubles into our SAS Viya Parquet integration. SAS writes a NAN to represent floating-point doubles that are “missing,” i.e. NULL in more general data management terms. Of course SAS’ goal is to create .parquet files that are universally readable. Therefore, it appears that the SAS Parquet writer(s) will NOT be able to write the usual NAN to represent “missing,” because doing so will cause a floating point exception for other readers. Based on the Parquet doc at: https://parquet.apache.org/documentation/latest/ and by examining code, I understand that Parquet NULL values are indicated by setting 0x000 at the definition level vector offset corresponding to each NULL column offset value. Conversely, It appears that the per-column, per page definition level data is never written when required is not specified for the column schema. Is my understanding and Parquet terminology correct here? Thanks, Brian
Definition Levels and Null
All, I’m working to integrate the historic usage of SAS missing values for IEEE doubles into our SAS Viya Parquet integration. SAS writes a NAN to represent floating-point doubles that are “missing,” i.e. NULL in more general data management terms. Of course SAS’ goal is to create .parquet files that are universally readable. Therefore, it appears that the SAS Parquet writer(s) will NOT be able to write the usual NAN to represent “missing,” because doing so will cause a floating point exception for other readers. Based on the Parquet doc at: https://parquet.apache.org/documentation/latest/ and by examining code, I understand that Parquet NULL values are indicated by setting 0x000 at the definition level vector offset corresponding to each NULL column offset value. Conversely, It appears that the per-column, per page definition level data is never written when required is not specified for the column schema. Is my understanding and Parquet terminology correct here? Thanks, Brian
Parquet vs. other Open Source Columnar Formats
All, Is it fair to say that Parquet is fast becoming the dominate open source columnar storage format? How do those of you with long-term Hadoop experience see this? For example, is Parquet overtaking ORC and Avro? Thanks, Brian
Re: Need 64-bit Integer length for Parquet ByteArray Type
Hello Wes, Thanks for the info! I'm working to better understand Parquet/Arrow design and development processes. No hurry for LARGE_BYTE_ARRAY. -Brian On 4/26/19, 11:14 AM, "Wes McKinney" wrote: EXTERNAL hi Brian, I doubt that such a change could be made on a short time horizon. Collecting feedback and building consensus (if it is even possible) with stakeholders would take some time. The appropriate place to have the discussion is here on the mailing list, though Thanks On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman wrote: > > Hello Wes/all, > > A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives. Is this something that could be done in Parquet over the next few months? I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++. > > -Brian > > On 4/5/19, 3:44 PM, "Wes McKinney" wrote: > > EXTERNAL > > hi Brian, > > Just to comment from the C++ side -- the 64-bit issue is a limitation > of the Parquet format itself and not related to the C++ > implementation. It would be possibly interesting to add a > LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing > doing much the same in Apache Arrow for in-memory) > > - Wes > > On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue wrote: > > > > I don't think that's what you would want to do. Parquet will eventually > > compress large values, but not after making defensive copies and attempting > > to encode them. In the end, it will be a lot more overhead, plus the work > > to make it possible. I think you'd be much better of compressing before > > storing in Parquet if you expect good compression rates. > > > > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman wrote: > > > > > My hope is that these large ByteArray values will encode/compress to a > > > fraction of their original size. FWIW, cpp/src/parquet/ > > > column_writer.cc/.h has int64_t offset and length fields all over the > > > place. > > > > > > External file references to BLOBS is doable but not the elegant, > > > integrated solution I was hoping for. > > > > > > -Brian > > > > > > On Apr 5, 2019, at 1:53 PM, Ryan Blue wrote: > > > > > > *EXTERNAL* > > > Looks like we will need a new encoding for this: > > > https://github.com/apache/parquet-format/blob/master/Encodings.md > > > > > > That doc specifies that the plain encoding uses a 4-byte length. That's > > > not going to be a quick fix. > > > > > > Now that I'm thinking about this a bit more, does it make sense to support > > > byte arrays that are more than 2GB? That's far larger than the size of a > > > row group, let alone a page. This would completely break memory management > > > in the JVM implementation. > > > > > > Can you solve this problem using a BLOB type that references an external > > > file with the gigantic values? Seems to me that values this large should go > > > in separate files, not in a Parquet file where it would destroy any benefit > > > from using the format. > > > > > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman wrote: > > > > > >> Hello Ryan, > > >> > > >> Looks like it's limited by both the Parquet implementation and the Thrift > > >> message methods. Am I missing anything? > > >> > > >> From cpp/src/parquet/types.h > > >> > > >> struct ByteArray { > > >> ByteArray() : len(0), ptr(NULLPTR) {} > > >> ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} > > >> uint32_t len; > > >> const uint8_t* ptr; > > >> }; > > >> > > >> From cpp/src/parquet/thrift.h > > >> > > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* > > >> deserialized_msg) { >
Re: Parquet Sync
Does the sync happen on Google Hangout? Could someone please provide a link on where to sign up/connect? Thanks, Brian > On Apr 18, 2019, at 12:51 PM, Xinli shang wrote: > > EXTERNAL > > Hi all, > > Please send your agenda for the next Parquet community sync up meeting. I > will compile and send the list before the meeting. One of the agenda I have > so far is encryption. The meeting will be tentatively at April 30 Tuesday > 9-10am PT, just like our previous regular meeting time. Please let me know > if you have any questions for agenda or date/time. > > Xinli > > On Mon, Apr 15, 2019 at 10:54 PM Julien Le Dem > wrote: > >> It would be fine to have a rotation. >> >> On Mon, Apr 15, 2019 at 10:44 PM Lars Volker >> wrote: >> >>> Hi, >>> >>> I'd be happy to help. I have organized a few of these in the past, and >> I've >>> recently started similar meetings for the Impala project. >>> >>> If someone else wants to do it, that's fine for me, too, of course. >>> >>> Cheers, Lars >>> >>> On Mon, Apr 15, 2019, 22:14 Julien Le Dem >> wrote: >>> Hello all, Since I have been away with the new baby the Parquet syncs have fallen behind. I'd like a volunteer to run those. Responsibilities include taking notes and posting them on the list. Also occasionally finding a good time for the meeting. Any takers? This could be a rotating duty as well. Thank you Julien >>> >> > > > -- > Xinli Shang
Re: Parquet Sync
All, I look forward to participating in the upcoming Parquet Syncs. I'll be happy to be a "scribe in rotation" but would first like to participate in a couple of Syncs. By way of introduction: I'm Brian Bowman, 34+ year veteran of SAS R I've been working with Parquet Open Source and C++ for the past four months but have no prior open source experience. My career has been programming in Assembly, C, Java and SAS, with decades of work in file format design, storage layer internals, and scalable distributed access control capabilities. For the past 5 years I've been doing core R for Cloud Analytic Services (CAS) -- the modern SAS distributed analytics and data management framework. I work on the CAS distributed table, I/O, and indexing capabilities ... and now Parquet integration with CAS. Arrow/Parquet are exciting technologies and I look forward to more work with this group as our efforts move ahead. Best, Brian Brian Bowman Principal Software Developer Analytic Server R SAS Institute Inc. On 4/16/19, 1:54 AM, "Julien Le Dem" wrote: EXTERNAL It would be fine to have a rotation. On Mon, Apr 15, 2019 at 10:44 PM Lars Volker wrote: > Hi, > > I'd be happy to help. I have organized a few of these in the past, and I've > recently started similar meetings for the Impala project. > > If someone else wants to do it, that's fine for me, too, of course. > > Cheers, Lars > > On Mon, Apr 15, 2019, 22:14 Julien Le Dem wrote: > > > Hello all, > > Since I have been away with the new baby the Parquet syncs have fallen > > behind. > > I'd like a volunteer to run those. > > Responsibilities include taking notes and posting them on the list. > > Also occasionally finding a good time for the meeting. > > Any takers? This could be a rotating duty as well. > > Thank you > > Julien > > >
Re: Current Parquet Version
Fokko, Thank you! I'm not very experienced with GitHub yet and had looked in the wrong place. Best, Brian On 4/9/19, 10:38 PM, "Driesprong, Fokko" wrote: EXTERNAL Hi Brian, You could take a look at the Github of the Apache Parquet Format itself: https://github.com/apache/parquet-format Cheers, Fokko Op ma 8 apr. 2019 om 20:19 schreef Brian Bowman : > What is most current Apache Parquet file format version? Where is this > designated on the official Apache (or GitHub) site? > > Thanks, > > > Brian >
Re: Need 64-bit Integer length for Parquet ByteArray Type
Hello Wes/all, A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives. Is this something that could be done in Parquet over the next few months? I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++. -Brian On 4/5/19, 3:44 PM, "Wes McKinney" wrote: EXTERNAL hi Brian, Just to comment from the C++ side -- the 64-bit issue is a limitation of the Parquet format itself and not related to the C++ implementation. It would be possibly interesting to add a LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing doing much the same in Apache Arrow for in-memory) - Wes On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue wrote: > > I don't think that's what you would want to do. Parquet will eventually > compress large values, but not after making defensive copies and attempting > to encode them. In the end, it will be a lot more overhead, plus the work > to make it possible. I think you'd be much better of compressing before > storing in Parquet if you expect good compression rates. > > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman wrote: > > > My hope is that these large ByteArray values will encode/compress to a > > fraction of their original size. FWIW, cpp/src/parquet/ > > column_writer.cc/.h has int64_t offset and length fields all over the > > place. > > > > External file references to BLOBS is doable but not the elegant, > > integrated solution I was hoping for. > > > > -Brian > > > > On Apr 5, 2019, at 1:53 PM, Ryan Blue wrote: > > > > *EXTERNAL* > > Looks like we will need a new encoding for this: > > https://github.com/apache/parquet-format/blob/master/Encodings.md > > > > That doc specifies that the plain encoding uses a 4-byte length. That's > > not going to be a quick fix. > > > > Now that I'm thinking about this a bit more, does it make sense to support > > byte arrays that are more than 2GB? That's far larger than the size of a > > row group, let alone a page. This would completely break memory management > > in the JVM implementation. > > > > Can you solve this problem using a BLOB type that references an external > > file with the gigantic values? Seems to me that values this large should go > > in separate files, not in a Parquet file where it would destroy any benefit > > from using the format. > > > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman wrote: > > > >> Hello Ryan, > >> > >> Looks like it's limited by both the Parquet implementation and the Thrift > >> message methods. Am I missing anything? > >> > >> From cpp/src/parquet/types.h > >> > >> struct ByteArray { > >> ByteArray() : len(0), ptr(NULLPTR) {} > >> ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} > >> uint32_t len; > >> const uint8_t* ptr; > >> }; > >> > >> From cpp/src/parquet/thrift.h > >> > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* > >> deserialized_msg) { > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* > >> out) > >> > >> -Brian > >> > >> On 4/5/19, 1:32 PM, "Ryan Blue" wrote: > >> > >> EXTERNAL > >> > >> Hi Brian, > >> > >> This seems like something we should allow. What imposes the current > >> limit? > >> Is it in the thrift format, or just the implementations? > >> > >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman > >> wrote: > >> > >> > All, > >> > > >> > SAS requires support for storing varying-length character and > >> binary blobs > >> > with a 2^64 max length in Parquet. Currently, the ByteArray len > >> field is > >> > a unint32_t. Looks this the will require incrementing the Parquet > >> file > >> > format version and changing ByteArray len to uint64_t. > >> > > >> > Have there been any requests for this or other Parquet developments > >> that > >> > require file format versioning changes? > >> > > >> > I realize this a non-trivial ask. Thanks for considering it. > >> > > >> > -Brian > >> > > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix > >> > >> > >> > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > Ryan Blue > Software Engineer > Netflix
Current Parquet Version
What is most current Apache Parquet file format version? Where is this designated on the official Apache (or GitHub) site? Thanks, Brian
Re: Need 64-bit Integer length for Parquet ByteArray Type
Thanks Ryan, After further pondering this, I came to similar conclusions. Compress the data before putting it into a Parquet ByteArray and if that’s not feasible reference it in an external/persisted data structure Another alternative is to create one or more “shadow columns” to store the overflow horizontally. -Brian On Apr 5, 2019, at 3:11 PM, Ryan Blue mailto:rb...@netflix.com>> wrote: EXTERNAL I don't think that's what you would want to do. Parquet will eventually compress large values, but not after making defensive copies and attempting to encode them. In the end, it will be a lot more overhead, plus the work to make it possible. I think you'd be much better of compressing before storing in Parquet if you expect good compression rates. On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman mailto:brian.bow...@sas.com>> wrote: My hope is that these large ByteArray values will encode/compress to a fraction of their original size. FWIW, cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t offset and length fields all over the place. External file references to BLOBS is doable but not the elegant, integrated solution I was hoping for. -Brian On Apr 5, 2019, at 1:53 PM, Ryan Blue mailto:rb...@netflix.com>> wrote: EXTERNAL Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix. Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation. Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format. On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman mailto:brian.bow...@sas.com>> wrote: Hello Ryan, Looks like it's limited by both the Parquet implementation and the Thrift message methods. Am I missing anything? From cpp/src/parquet/types.h struct ByteArray { ByteArray() : len(0), ptr(NULLPTR) {} ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} uint32_t len; const uint8_t* ptr; }; From cpp/src/parquet/thrift.h inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) { inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) -Brian On 4/5/19, 1:32 PM, "Ryan Blue" mailto:rb...@netflix.com.INVALID>> wrote: EXTERNAL Hi Brian, This seems like something we should allow. What imposes the current limit? Is it in the thrift format, or just the implementations? On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman mailto:brian.bow...@sas.com>> wrote: > All, > > SAS requires support for storing varying-length character and binary blobs > with a 2^64 max length in Parquet. Currently, the ByteArray len field is > a unint32_t. Looks this the will require incrementing the Parquet file > format version and changing ByteArray len to uint64_t. > > Have there been any requests for this or other Parquet developments that > require file format versioning changes? > > I realize this a non-trivial ask. Thanks for considering it. > > -Brian > -- Ryan Blue Software Engineer Netflix -- Ryan Blue Software Engineer Netflix -- Ryan Blue Software Engineer Netflix
Re: Need 64-bit Integer length for Parquet ByteArray Type
My hope is that these large ByteArray values will encode/compress to a fraction of their original size. FWIW, cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t offset and length fields all over the place. External file references to BLOBS is doable but not the elegant, integrated solution I was hoping for. -Brian On Apr 5, 2019, at 1:53 PM, Ryan Blue mailto:rb...@netflix.com>> wrote: EXTERNAL Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix. Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation. Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format. On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman mailto:brian.bow...@sas.com>> wrote: Hello Ryan, Looks like it's limited by both the Parquet implementation and the Thrift message methods. Am I missing anything? From cpp/src/parquet/types.h struct ByteArray { ByteArray() : len(0), ptr(NULLPTR) {} ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} uint32_t len; const uint8_t* ptr; }; From cpp/src/parquet/thrift.h inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) { inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) -Brian On 4/5/19, 1:32 PM, "Ryan Blue" mailto:rb...@netflix.com.INVALID>> wrote: EXTERNAL Hi Brian, This seems like something we should allow. What imposes the current limit? Is it in the thrift format, or just the implementations? On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman mailto:brian.bow...@sas.com>> wrote: > All, > > SAS requires support for storing varying-length character and binary blobs > with a 2^64 max length in Parquet. Currently, the ByteArray len field is > a unint32_t. Looks this the will require incrementing the Parquet file > format version and changing ByteArray len to uint64_t. > > Have there been any requests for this or other Parquet developments that > require file format versioning changes? > > I realize this a non-trivial ask. Thanks for considering it. > > -Brian > -- Ryan Blue Software Engineer Netflix -- Ryan Blue Software Engineer Netflix
Re: Need 64-bit Integer length for Parquet ByteArray Type
Hello Ryan, Looks like it's limited by both the Parquet implementation and the Thrift message methods. Am I missing anything? From cpp/src/parquet/types.h struct ByteArray { ByteArray() : len(0), ptr(NULLPTR) {} ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} uint32_t len; const uint8_t* ptr; }; From cpp/src/parquet/thrift.h inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* deserialized_msg) { inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) -Brian On 4/5/19, 1:32 PM, "Ryan Blue" wrote: EXTERNAL Hi Brian, This seems like something we should allow. What imposes the current limit? Is it in the thrift format, or just the implementations? On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman wrote: > All, > > SAS requires support for storing varying-length character and binary blobs > with a 2^64 max length in Parquet. Currently, the ByteArray len field is > a unint32_t. Looks this the will require incrementing the Parquet file > format version and changing ByteArray len to uint64_t. > > Have there been any requests for this or other Parquet developments that > require file format versioning changes? > > I realize this a non-trivial ask. Thanks for considering it. > > -Brian > -- Ryan Blue Software Engineer Netflix
Need 64-bit Integer length for Parquet ByteArray Type
All, SAS requires support for storing varying-length character and binary blobs with a 2^64 max length in Parquet. Currently, the ByteArray len field is a unint32_t. Looks this the will require incrementing the Parquet file format version and changing ByteArray len to uint64_t. Have there been any requests for this or other Parquet developments that require file format versioning changes? I realize this a non-trivial ask. Thanks for considering it. -Brian
Re: Passing File Descriptors in the Low-Level API
Thanks Wes! I'm working on the integrating and testing the necessary changes in our dev environment. I'll submit a PR once things are working. Best, Brian On 3/16/19, 4:24 PM, "Wes McKinney" wrote: EXTERNAL hi Brian, Please feel free to submit a PR to add the requisite APIs that you need for your application. Antoine or I or others should be able to give prompt feedback since we know this code pretty well. Thanks Wes On Sat, Mar 16, 2019 at 11:40 AM Brian Bowman wrote: > > Hi Wes, > > Thanks for the quick reply! To be clear, the usage I'm working on needs to own both the Open FileDescriptor and corresponding mapped memory. In other words ... > > SAS component does both open() and mmap() which could be for READ or WRITE. > > -> Calls low-level Parquet APIs to read an existing file or write a new one. The open() and mmap() flags are guaranteed to be correct. > > At some later point SAS component does an unmap() and close(). > > -Brian > > > On 3/14/19, 3:42 PM, "Wes McKinney" wrote: > > hi Brian, > > This is mostly an Arrow platform question so I'm copying the Arrow mailing list. > > You can open a file using an existing file descriptor using ReadableFile::Open > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145 > > The documentation for this function says: > > "The file descriptor becomes owned by the ReadableFile, and will be > closed on Close() or destruction." > > If you want to do the equivalent thing, but using memory mapping, I > think you'll need to add a corresponding API to MemoryMappedFile. This > is more perilous because of the API requirements of mmap -- you need > to pass the right flags and they may need to be the same flags that > were passed when opening the file descriptor, see > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378 > > and > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476 > > - Wes > > On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman wrote: > > > > The ReadableFile class (arrow/io/file.cc) has utility methods where a FileDescriptor is either passed in or returned, but I don’t see how this surfaces through the API. > > > > Is there a way for application code to control the open lifetime of mmap()’d Parquet files by passing an already open FileDescriptor to Parquet low-level API open/close methods? > > > > Thanks, > > > > Brian > > > > >
Re: Passing File Descriptors in the Low-Level API
Hi Wes, Thanks for the quick reply! To be clear, the usage I'm working on needs to own both the Open FileDescriptor and corresponding mapped memory. In other words ... SAS component does both open() and mmap() which could be for READ or WRITE. -> Calls low-level Parquet APIs to read an existing file or write a new one. The open() and mmap() flags are guaranteed to be correct. At some later point SAS component does an unmap() and close(). -Brian On 3/14/19, 3:42 PM, "Wes McKinney" wrote: hi Brian, This is mostly an Arrow platform question so I'm copying the Arrow mailing list. You can open a file using an existing file descriptor using ReadableFile::Open https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145 The documentation for this function says: "The file descriptor becomes owned by the ReadableFile, and will be closed on Close() or destruction." If you want to do the equivalent thing, but using memory mapping, I think you'll need to add a corresponding API to MemoryMappedFile. This is more perilous because of the API requirements of mmap -- you need to pass the right flags and they may need to be the same flags that were passed when opening the file descriptor, see https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378 and https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476 - Wes On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman wrote: > > The ReadableFile class (arrow/io/file.cc) has utility methods where a FileDescriptor is either passed in or returned, but I don’t see how this surfaces through the API. > > Is there a way for application code to control the open lifetime of mmap()’d Parquet files by passing an already open FileDescriptor to Parquet low-level API open/close methods? > > Thanks, > > Brian >
Passing File Descriptors in the Low-Level API
The ReadableFile class (arrow/io/file.cc) has utility methods where a FileDescriptor is either passed in or returned, but I don’t see how this surfaces through the API. Is there a way for application code to control the open lifetime of mmap()’d Parquet files by passing an already open FileDescriptor to Parquet low-level API open/close methods? Thanks, Brian