[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file
[ https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584436#comment-16584436 ] ASF GitHub Bot commented on PARQUET-1369: - rgruener opened a new pull request #491: PARQUET-1369: Disregard column sort order if statistics max/min are equal URL: https://github.com/apache/parquet-cpp/pull/491 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Unavailable Parquet column statistics from Spark-generated file > > > Key: PARQUET-1369 > URL: https://issues.apache.org/jira/browse/PARQUET-1369 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Robert Gruener >Assignee: Robert Gruener >Priority: Major > Labels: parquet, pull-request-available > Fix For: cpp-1.5.0 > > > I have a dataset generated by spark which shows it has statistics for the > string column when using the java parquet-mr code (shown by using > `parquet-tools meta`) however reading from pyarrow shows that the statistics > for that column are not set. I should not the column only has a single > value, though it still seems like a problem that pyarrow can't recognize it > (it can recognize statistics set for the long and double types). > See https://github.com/apache/arrow/files/2161147/metadata.zip for file > example. > Pyarrow Code To Check Statistics: > {code} > from pyarrow import parquet as pq > meta = pq.read_metadata('/tmp/metadata.parquet') > # No Statistics For String Column, prints false and statistics object is None > print(meta.row_group(0).column(1).is_stats_set) > {code} > Example parquet-meta output: > {code} > file schema: spark_schema > > int: REQUIRED INT64 R:0 D:0 > string: OPTIONAL BINARY O:UTF8 R:0 D:1 > float: REQUIRED DOUBLE R:0 D:0 > row group 1: RC:8333 TS:76031 OFFSET:4 > > int: INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 > ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0] > string: BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 > ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: > 4192] > float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 > ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, > num_nulls: 0] > {code} > I realize the column only has a single value though it still seems like > pyarrow should be able to read the statistics set. I made this here and not a > JIRA since I wanted to be sure this is actually an issue and there wasnt a > ticket already made there (I couldnt find one but I wanted to be sure). > Either way I would like to understand why this is -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file
[ https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1369: Labels: parquet pull-request-available (was: parquet) > [Python] Unavailable Parquet column statistics from Spark-generated file > > > Key: PARQUET-1369 > URL: https://issues.apache.org/jira/browse/PARQUET-1369 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Robert Gruener >Assignee: Robert Gruener >Priority: Major > Labels: parquet, pull-request-available > Fix For: cpp-1.5.0 > > > I have a dataset generated by spark which shows it has statistics for the > string column when using the java parquet-mr code (shown by using > `parquet-tools meta`) however reading from pyarrow shows that the statistics > for that column are not set. I should not the column only has a single > value, though it still seems like a problem that pyarrow can't recognize it > (it can recognize statistics set for the long and double types). > See https://github.com/apache/arrow/files/2161147/metadata.zip for file > example. > Pyarrow Code To Check Statistics: > {code} > from pyarrow import parquet as pq > meta = pq.read_metadata('/tmp/metadata.parquet') > # No Statistics For String Column, prints false and statistics object is None > print(meta.row_group(0).column(1).is_stats_set) > {code} > Example parquet-meta output: > {code} > file schema: spark_schema > > int: REQUIRED INT64 R:0 D:0 > string: OPTIONAL BINARY O:UTF8 R:0 D:1 > float: REQUIRED DOUBLE R:0 D:0 > row group 1: RC:8333 TS:76031 OFFSET:4 > > int: INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 > ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0] > string: BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 > ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: > 4192] > float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 > ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, > num_nulls: 0] > {code} > I realize the column only has a single value though it still seems like > pyarrow should be able to read the statistics set. I made this here and not a > JIRA since I wanted to be sure this is actually an issue and there wasnt a > ticket already made there (I couldnt find one but I wanted to be sure). > Either way I would like to understand why this is -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool
[ https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned PARQUET-1256: - Assignee: Jacek Pliszka > [C++] Add --print-key-value-metadata option to parquet_reader tool > -- > > Key: PARQUET-1256 > URL: https://issues.apache.org/jira/browse/PARQUET-1256 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Jacek Pliszka >Assignee: Jacek Pliszka >Priority: Trivial > Labels: patch, pull-request-available > Fix For: cpp-1.5.0 > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > Added --print-key-value-metadata option to parquet_reader tool > https://github.com/apache/parquet-cpp/pull/450 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool
[ https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1256. --- Resolution: Fixed Issue resolved by pull request 450 [https://github.com/apache/parquet-cpp/pull/450] > [C++] Add --print-key-value-metadata option to parquet_reader tool > -- > > Key: PARQUET-1256 > URL: https://issues.apache.org/jira/browse/PARQUET-1256 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Jacek Pliszka >Priority: Trivial > Labels: patch, pull-request-available > Fix For: cpp-1.5.0 > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > Added --print-key-value-metadata option to parquet_reader tool > https://github.com/apache/parquet-cpp/pull/450 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool
[ https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584429#comment-16584429 ] ASF GitHub Bot commented on PARQUET-1256: - wesm closed pull request #450: PARQUET-1256: Add --print-key-value-metadata option to parquet_reader tool URL: https://github.com/apache/parquet-cpp/pull/450 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/parquet/printer.cc b/src/parquet/printer.cc index 3f18a5c8..9f26a418 100644 --- a/src/parquet/printer.cc +++ b/src/parquet/printer.cc @@ -33,13 +33,25 @@ namespace parquet { #define COL_WIDTH "30" void ParquetFilePrinter::DebugPrint(std::ostream& stream, std::list selected_columns, -bool print_values, const char* filename) { +bool print_values, bool print_key_value_metadata, +const char* filename) { const FileMetaData* file_metadata = fileReader->metadata().get(); stream << "File Name: " << filename << "\n"; stream << "Version: " << file_metadata->version() << "\n"; stream << "Created By: " << file_metadata->created_by() << "\n"; stream << "Total rows: " << file_metadata->num_rows() << "\n"; + + if (print_key_value_metadata) { +auto key_value_metadata = file_metadata->key_value_metadata(); +int64_t size_of_key_value_metadata = key_value_metadata->size(); +stream << "Key Value File Metadata: " << size_of_key_value_metadata << " entries\n"; +for (int64_t i = 0; i < size_of_key_value_metadata; i++) { + stream << " Key nr " << i << " " << key_value_metadata->key(i) << ": " + << key_value_metadata->value(i) << "\n"; +} + } + stream << "Number of RowGroups: " << file_metadata->num_row_groups() << "\n"; stream << "Number of Real Columns: " << file_metadata->schema()->group_node()->field_count() << "\n"; diff --git a/src/parquet/printer.h b/src/parquet/printer.h index 3b828829..1113c3fe 100644 --- a/src/parquet/printer.h +++ b/src/parquet/printer.h @@ -38,7 +38,8 @@ class PARQUET_EXPORT ParquetFilePrinter { ~ParquetFilePrinter() {} void DebugPrint(std::ostream& stream, std::list selected_columns, - bool print_values = true, const char* fileame = "No Name"); + bool print_values = true, bool print_key_value_metadata = false, + const char* filename = "No Name"); void JSONPrint(std::ostream& stream, std::list selected_columns, const char* filename = "No Name"); diff --git a/tools/parquet_reader.cc b/tools/parquet_reader.cc index 7ef59dc1..34bdfc10 100644 --- a/tools/parquet_reader.cc +++ b/tools/parquet_reader.cc @@ -24,13 +24,14 @@ int main(int argc, char** argv) { if (argc > 5 || argc < 2) { std::cerr << "Usage: parquet_reader [--only-metadata] [--no-memory-map] [--json]" - "[--columns=...] " + "[--print-key-value-metadata] [--columns=...] " << std::endl; return -1; } std::string filename; bool print_values = true; + bool print_key_value_metadata = false; bool memory_map = true; bool format_json = false; @@ -42,6 +43,8 @@ int main(int argc, char** argv) { for (int i = 1; i < argc; i++) { if ((param = std::strstr(argv[i], "--only-metadata"))) { print_values = false; +} else if ((param = std::strstr(argv[i], "--print-key-value-metadata"))) { + print_key_value_metadata = true; } else if ((param = std::strstr(argv[i], "--no-memory-map"))) { memory_map = false; } else if ((param = std::strstr(argv[i], "--json"))) { @@ -64,7 +67,8 @@ int main(int argc, char** argv) { if (format_json) { printer.JSONPrint(std::cout, columns, filename.c_str()); } else { - printer.DebugPrint(std::cout, columns, print_values, filename.c_str()); + printer.DebugPrint(std::cout, columns, print_values, +print_key_value_metadata, filename.c_str()); } } catch (const std::exception& e) { std::cerr << "Parquet error: " << e.what() << std::endl; This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Add --print-key-value-metadata option to parquet_reader tool > -- > > Key: PARQUET-1256 > URL: https://issues.apache.org/jira/browse/PARQUET-1256 > Project:
[jira] [Updated] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool
[ https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1256: Labels: patch pull-request-available (was: patch) > [C++] Add --print-key-value-metadata option to parquet_reader tool > -- > > Key: PARQUET-1256 > URL: https://issues.apache.org/jira/browse/PARQUET-1256 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Jacek Pliszka >Priority: Trivial > Labels: patch, pull-request-available > Fix For: cpp-1.5.0 > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > Added --print-key-value-metadata option to parquet_reader tool > https://github.com/apache/parquet-cpp/pull/450 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PARQUET-1370) [C++] Read consecutive column chunks in a single scan
[ https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584397#comment-16584397 ] Robert Gruener edited comment on PARQUET-1370 at 8/17/18 9:20 PM: -- That seems to only be the case for python3. Do the pyarrow file handles not implement RawIOBase in python2 as well? As far as I can tell the code does not suggest that though those have been my results. was (Author: rgruener): That seems to only be the case for python3. Do the pyarrow file handles no implement RawIOBase in python2 as well? As far as I can tell the code does not suggest that though those have been my results. > [C++] Read consecutive column chunks in a single scan > - > > Key: PARQUET-1370 > URL: https://issues.apache.org/jira/browse/PARQUET-1370 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Robert Gruener >Priority: Major > > Currently parquet-cpp calls for a filesystem scan with every single data page > see > [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181] > For remote filesystems this can be very inefficient when reading many small > columns. The java implementation already does this and will read consecutive > column chunks (and the resulting pages) in a single scan see > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786] > > This might be a bit difficult to do, as it would require changing a lot of > the code structure but it would certainly be valuable for workloads concerned > with optimal read performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1370) [C++] Read consecutive column chunks in a single scan
[ https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584397#comment-16584397 ] Robert Gruener commented on PARQUET-1370: - That seems to only be the case for python3. Do the pyarrow file handles no implement RawIOBase in python2 as well? As far as I can tell the code does not suggest that though those have been my results. > [C++] Read consecutive column chunks in a single scan > - > > Key: PARQUET-1370 > URL: https://issues.apache.org/jira/browse/PARQUET-1370 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Robert Gruener >Priority: Major > > Currently parquet-cpp calls for a filesystem scan with every single data page > see > [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181] > For remote filesystems this can be very inefficient when reading many small > columns. The java implementation already does this and will read consecutive > column chunks (and the resulting pages) in a single scan see > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786] > > This might be a bit difficult to do, as it would require changing a lot of > the code structure but it would certainly be valuable for workloads concerned > with optimal read performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc
[ https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584161#comment-16584161 ] ASF GitHub Bot commented on PARQUET-1384: - wesm closed pull request #490: PARQUET-1384: fix clang build error for bloom_filter-test.cc URL: https://github.com/apache/parquet-cpp/pull/490 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/parquet/bloom_filter-test.cc b/src/parquet/bloom_filter-test.cc index 69583af5..96d2e065 100644 --- a/src/parquet/bloom_filter-test.cc +++ b/src/parquet/bloom_filter-test.cc @@ -165,7 +165,7 @@ TEST(CompatibilityTest, TestBloomFilter) { std::unique_ptr bitset(new uint8_t[size]()); std::shared_ptr buffer(new Buffer(bitset.get(), size)); - handle->Read(size, ); + PARQUET_THROW_NOT_OK(handle->Read(size, )); InMemoryInputStream source(buffer); BlockSplitBloomFilter bloom_filter1 = BlockSplitBloomFilter::Deserialize(); @@ -192,10 +192,10 @@ TEST(CompatibilityTest, TestBloomFilter) { bloom_filter2.WriteTo(); std::shared_ptr buffer1 = sink.GetBuffer(); - handle->Seek(0); - handle->GetSize(); + PARQUET_THROW_NOT_OK(handle->Seek(0)); + PARQUET_THROW_NOT_OK(handle->GetSize()); std::shared_ptr buffer2; - handle->Read(size, ); + PARQUET_THROW_NOT_OK(handle->Read(size, )); EXPECT_TRUE((*buffer1).Equals(*buffer2)); } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Clang compiler warnings in bloom_filter-test.cc > - > > Key: PARQUET-1384 > URL: https://issues.apache.org/jira/browse/PARQUET-1384 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Junjie Chen >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > {code} > [69/95] Building CXX object > src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o > ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^ > ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^~ > ~ > ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > std::move(BlockSplitBloomFilter::Deserialize()); > ^ > ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here > std::move(BlockSplitBloomFilter::Deserialize()); > ^~ ~ > ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Seek(0); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->GetSize(); > ^~~ ~ > ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Parquet sync meeting minutes
Hi, Sorry, that was an error on my side, I suggested Nandor to add a TLDR section with this title. I agree with your comment, Wes, outcome would have been a better choice of word than decision. Br, Zoltan On Fri, Aug 17, 2018 at 6:36 PM Wes McKinney wrote: > hi Nandor, > > A fine detail, and I may be wrong, but I don't think decisions can > technically be made on a call because time zones do not permit > everyone to join always and not all collaborators are comfortable > having live discussions in English. see [1] > > You can present the consensus of the participants in the call summary > and others in the community have an opportunity to provide feedback. > The "decision" is therefore one based on lazy consensus thereafter if > there are no objections or follow up discussion > > - Wes > > [1]: https://www.apache.org/foundation/how-it-works.html#management > > On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar > wrote: > > Topics discussed and decisions (meeting held on 2018 August 15th, at > > 6pm CET / 9 am PST): > > > > - Aligning page row boundaries between different columns: Debated, > > please follow-up > > - Remove Java specific code from parquet-format: Accepted > > - Column encryption: Please review > > - Parquet-format release: Scope accepted > > - C++ mono-repo: Please vote > > > > > > > > Aligning page row boundaries between different columns (Gabor) > > -- > > > > Background: In the existing specification of column indexes, page > > boundaries are not aligned between different column in respect to row > > count. > > > > Gabor: implemented this logic, interested parties can review the code > here: > > - https://github.com/apache/parquet-mr/pull/509 > > - https://github.com/apache/parquet-mr/commits/column-indexes > > > > Main takeaway from implementation: > > > > - Index filtering logic as currently specified is overcomplicated. > > - May become a maintenance burden and results in steep learning curve > > for onboarding - new developers. > > - Can not be made transparent, vectorized readers (Hive, Spark) have > > to implement a similar logic. > > > > Suggestion: > > > > - Align page row boundaries between different columns, i.e. the n-th > > page of every column should contain the same number of rows. > > - Filtering logic would be a lot simpler. > > - Vectorized readers will get index-based filtering without any change > > required on their side. > > > > Response: > > - Ryan doesn't recommend it. Performance numbers? > > - Discuss offline or on dev mailing list > > - Timeline for reaching decision? Within a week. (Gabor already has a > > working implementation.) > > > > > > > > Remove Java specific code from parquet-format (Nandor) > > -- > > > > Background: Parquet-format contains a few Java classes. Earlier no > > changes were required in these, but this has changed in recent > > features, especially with the new column encryption feature, which > > would add substantial new code. > > > > Suggestion (Nandor): Instead of cluttering parquet-format further with > > java-specific code, move these classes to parquet-mr and deprecate > > them in parquet-format. > > > > What is the motivation behind the status quo? Julien: We may need a > > different Thrift version in the parquet-thrift binding than in the > > parquet files themselves. If we move these classes to parquet-mr, we > > should shade thrift. Additionally, currently a thrift-compiler is only > > needed for parquet-format, not parquet-mr, this will change. Gabor: > > Dockerization may help. > > > > Julien: We could merge the two repos altogether as well. Gabor: This, > > however would move the specification into the Java implementation, > > which would be against the cross-language ideology, so let's keep the > > separate repo for the format. Zoltan: Other language binding should > > also consider directly using it instead of copying parquet.thrift into > > their source code. > > > > > > > > Column encryption (Gidon) > > - > > > > Under development: > > - Key management API (doesn’t provide E2E key management) (PARQUET-1373) > > - Anonymization and data masking (PARQUET-1376) > > > > Java PRs under review: > > - https://github.com/apache/parquet-mr/pull/471 > > - https://github.com/apache/parquet-mr/pull/472 > > > > C++ PR: > > - https://github.com/apache/parquet-cpp/pull/475 > > > > > > We need more testing (both unit tests and interop tests between Java and > C++). > > > > > > > > Parquet-format release (Zoltan) > > --- > > > > Suggested scope (Zoltan): > > - Column encryption > > - Nanosec precision > > - Anything else? > > > > Discussion: > > - Nothing else to add. > > - Wes welcomes the nano precision, will be needed in parquet-cpp as well. > > > > > > > > C++ mono-repo: merging Arrow and parquet-cpp (Wes) > > -- > > > > >
Re: Parquet sync meeting minutes
hi Nandor, A fine detail, and I may be wrong, but I don't think decisions can technically be made on a call because time zones do not permit everyone to join always and not all collaborators are comfortable having live discussions in English. see [1] You can present the consensus of the participants in the call summary and others in the community have an opportunity to provide feedback. The "decision" is therefore one based on lazy consensus thereafter if there are no objections or follow up discussion - Wes [1]: https://www.apache.org/foundation/how-it-works.html#management On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar wrote: > Topics discussed and decisions (meeting held on 2018 August 15th, at > 6pm CET / 9 am PST): > > - Aligning page row boundaries between different columns: Debated, > please follow-up > - Remove Java specific code from parquet-format: Accepted > - Column encryption: Please review > - Parquet-format release: Scope accepted > - C++ mono-repo: Please vote > > > > Aligning page row boundaries between different columns (Gabor) > -- > > Background: In the existing specification of column indexes, page > boundaries are not aligned between different column in respect to row > count. > > Gabor: implemented this logic, interested parties can review the code here: > - https://github.com/apache/parquet-mr/pull/509 > - https://github.com/apache/parquet-mr/commits/column-indexes > > Main takeaway from implementation: > > - Index filtering logic as currently specified is overcomplicated. > - May become a maintenance burden and results in steep learning curve > for onboarding - new developers. > - Can not be made transparent, vectorized readers (Hive, Spark) have > to implement a similar logic. > > Suggestion: > > - Align page row boundaries between different columns, i.e. the n-th > page of every column should contain the same number of rows. > - Filtering logic would be a lot simpler. > - Vectorized readers will get index-based filtering without any change > required on their side. > > Response: > - Ryan doesn't recommend it. Performance numbers? > - Discuss offline or on dev mailing list > - Timeline for reaching decision? Within a week. (Gabor already has a > working implementation.) > > > > Remove Java specific code from parquet-format (Nandor) > -- > > Background: Parquet-format contains a few Java classes. Earlier no > changes were required in these, but this has changed in recent > features, especially with the new column encryption feature, which > would add substantial new code. > > Suggestion (Nandor): Instead of cluttering parquet-format further with > java-specific code, move these classes to parquet-mr and deprecate > them in parquet-format. > > What is the motivation behind the status quo? Julien: We may need a > different Thrift version in the parquet-thrift binding than in the > parquet files themselves. If we move these classes to parquet-mr, we > should shade thrift. Additionally, currently a thrift-compiler is only > needed for parquet-format, not parquet-mr, this will change. Gabor: > Dockerization may help. > > Julien: We could merge the two repos altogether as well. Gabor: This, > however would move the specification into the Java implementation, > which would be against the cross-language ideology, so let's keep the > separate repo for the format. Zoltan: Other language binding should > also consider directly using it instead of copying parquet.thrift into > their source code. > > > > Column encryption (Gidon) > - > > Under development: > - Key management API (doesn’t provide E2E key management) (PARQUET-1373) > - Anonymization and data masking (PARQUET-1376) > > Java PRs under review: > - https://github.com/apache/parquet-mr/pull/471 > - https://github.com/apache/parquet-mr/pull/472 > > C++ PR: > - https://github.com/apache/parquet-cpp/pull/475 > > > We need more testing (both unit tests and interop tests between Java and C++). > > > > Parquet-format release (Zoltan) > --- > > Suggested scope (Zoltan): > - Column encryption > - Nanosec precision > - Anything else? > > Discussion: > - Nothing else to add. > - Wes welcomes the nano precision, will be needed in parquet-cpp as well. > > > > C++ mono-repo: merging Arrow and parquet-cpp (Wes) > -- > > > Background: duplicated CI system and codebase, circular dependencies > between libraries > > Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be > read here: > https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E > > > Resolution: No objections but no final decision either, vote on the > parquet list: > https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E
[jira] [Commented] (PARQUET-1389) Improve value skipping at page synchronization
[ https://issues.apache.org/jira/browse/PARQUET-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584107#comment-16584107 ] ASF GitHub Bot commented on PARQUET-1389: - gszadovszky opened a new pull request #514: PARQUET-1389: Improve value skipping at page synchronization URL: https://github.com/apache/parquet-mr/pull/514 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve value skipping at page synchronization > -- > > Key: PARQUET-1389 > URL: https://issues.apache.org/jira/browse/PARQUET-1389 > Project: Parquet > Issue Type: Sub-task >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Minor > Labels: pull-request-available > > Currently, value skipping is done one-by-one for page synchronization. There > are encodings (e.g. plain) where several values can be skipped at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1389) Improve value skipping at page synchronization
[ https://issues.apache.org/jira/browse/PARQUET-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1389: Labels: pull-request-available (was: ) > Improve value skipping at page synchronization > -- > > Key: PARQUET-1389 > URL: https://issues.apache.org/jira/browse/PARQUET-1389 > Project: Parquet > Issue Type: Sub-task >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Minor > Labels: pull-request-available > > Currently, value skipping is done one-by-one for page synchronization. There > are encodings (e.g. plain) where several values can be skipped at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1310) Column indexes: Filtering
[ https://issues.apache.org/jira/browse/PARQUET-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-1310. --- Resolution: Fixed > Column indexes: Filtering > - > > Key: PARQUET-1310 > URL: https://issues.apache.org/jira/browse/PARQUET-1310 > Project: Parquet > Issue Type: Sub-task >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc
[ https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583997#comment-16583997 ] ASF GitHub Bot commented on PARQUET-1384: - cjjnjust closed pull request #488: PARQUET-1384: fix clang build error for bloom_filter-test.cc URL: https://github.com/apache/parquet-cpp/pull/488 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/parquet/bloom_filter-test.cc b/src/parquet/bloom_filter-test.cc index 69583af5..96d2e065 100644 --- a/src/parquet/bloom_filter-test.cc +++ b/src/parquet/bloom_filter-test.cc @@ -165,7 +165,7 @@ TEST(CompatibilityTest, TestBloomFilter) { std::unique_ptr bitset(new uint8_t[size]()); std::shared_ptr buffer(new Buffer(bitset.get(), size)); - handle->Read(size, ); + PARQUET_THROW_NOT_OK(handle->Read(size, )); InMemoryInputStream source(buffer); BlockSplitBloomFilter bloom_filter1 = BlockSplitBloomFilter::Deserialize(); @@ -192,10 +192,10 @@ TEST(CompatibilityTest, TestBloomFilter) { bloom_filter2.WriteTo(); std::shared_ptr buffer1 = sink.GetBuffer(); - handle->Seek(0); - handle->GetSize(); + PARQUET_THROW_NOT_OK(handle->Seek(0)); + PARQUET_THROW_NOT_OK(handle->GetSize()); std::shared_ptr buffer2; - handle->Read(size, ); + PARQUET_THROW_NOT_OK(handle->Read(size, )); EXPECT_TRUE((*buffer1).Equals(*buffer2)); } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Clang compiler warnings in bloom_filter-test.cc > - > > Key: PARQUET-1384 > URL: https://issues.apache.org/jira/browse/PARQUET-1384 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Junjie Chen >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > {code} > [69/95] Building CXX object > src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o > ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^ > ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^~ > ~ > ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > std::move(BlockSplitBloomFilter::Deserialize()); > ^ > ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here > std::move(BlockSplitBloomFilter::Deserialize()); > ^~ ~ > ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Seek(0); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->GetSize(); > ^~~ ~ > ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc
[ https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583996#comment-16583996 ] ASF GitHub Bot commented on PARQUET-1384: - cjjnjust opened a new pull request #490: PARQUET-1384: fix clang build error for bloom_filter-test.cc URL: https://github.com/apache/parquet-cpp/pull/490 replace https://github.com/apache/parquet-cpp/pull/488 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Clang compiler warnings in bloom_filter-test.cc > - > > Key: PARQUET-1384 > URL: https://issues.apache.org/jira/browse/PARQUET-1384 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Junjie Chen >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > {code} > [69/95] Building CXX object > src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o > ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^ > ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^~ > ~ > ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > std::move(BlockSplitBloomFilter::Deserialize()); > ^ > ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here > std::move(BlockSplitBloomFilter::Deserialize()); > ^~ ~ > ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Seek(0); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->GetSize(); > ^~~ ~ > ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
[ https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583976#comment-16583976 ] ASF GitHub Bot commented on PARQUET-1385: - wesm closed pull request #489: PARQUET-1385: Do not run TestBloomFilter.FPPTest when valgrind is in use URL: https://github.com/apache/parquet-cpp/pull/489 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/parquet/bloom_filter-test.cc b/src/parquet/bloom_filter-test.cc index dbef8c8b..dfdac12b 100644 --- a/src/parquet/bloom_filter-test.cc +++ b/src/parquet/bloom_filter-test.cc @@ -99,6 +99,11 @@ std::string GetRandomString(uint32_t length) { return ret; } +#ifndef PARQUET_VALGRIND + +// PARQUET-1385(wesm): This test is very slow under valgrind; we omit it in +// test runs for the sake of Travis CI + TEST(FPPTest, TestBloomFilter) { // It counts the number of times FindHash returns true. int exist = 0; @@ -137,6 +142,8 @@ TEST(FPPTest, TestBloomFilter) { EXPECT_TRUE(exist < total_count * fpp); } +#endif // PLASMA_VALGRIND + // The CompatibilityTest is used to test cross compatibility with parquet-mr, it reads // the Bloom filter binary generated by the Bloom filter class in the parquet-mr project // and tests whether the values inserted before could be filtered or not. diff --git a/src/parquet/types.h b/src/parquet/types.h index aec99656..10789cbf 100644 --- a/src/parquet/types.h +++ b/src/parquet/types.h @@ -114,13 +114,9 @@ struct Compression { }; struct Encryption { - enum type { -AES_GCM_V1 = 0, -AES_GCM_CTR_V1 = 1 - }; + enum type { AES_GCM_V1 = 0, AES_GCM_CTR_V1 = 1 }; }; - // parquet::PageType struct PageType { enum type { DATA_PAGE, INDEX_PAGE, DICTIONARY_PAGE, DATA_PAGE_V2 }; This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] bloom_filter-test is very slow under valgrind > --- > > Key: PARQUET-1385 > URL: https://issues.apache.org/jira/browse/PARQUET-1385 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > This test takes ~5 minutes to run under valgrind in Travis CI > {code} > 1: [==] Running 6 tests from 6 test cases. > 1: [--] Global test environment set-up. > 1: [--] 1 test from Murmur3Test > 1: [ RUN ] Murmur3Test.TestBloomFilter > 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) > 1: [--] 1 test from Murmur3Test (34 ms total) > 1: > 1: [--] 1 test from ConstructorTest > 1: [ RUN ] ConstructorTest.TestBloomFilter > 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) > 1: [--] 1 test from ConstructorTest (101 ms total) > 1: > 1: [--] 1 test from BasicTest > 1: [ RUN ] BasicTest.TestBloomFilter > 1: [ OK ] BasicTest.TestBloomFilter (49 ms) > 1: [--] 1 test from BasicTest (49 ms total) > 1: > 1: [--] 1 test from FPPTest > 1: [ RUN ] FPPTest.TestBloomFilter > 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) > 1: [--] 1 test from FPPTest (308741 ms total) > 1: > 1: [--] 1 test from CompatibilityTest > 1: [ RUN ] CompatibilityTest.TestBloomFilter > 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) > 1: [--] 1 test from CompatibilityTest (62 ms total) > 1: > 1: [--] 1 test from OptimalValueTest > 1: [ RUN ] OptimalValueTest.TestBloomFilter > 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) > 1: [--] 1 test from OptimalValueTest (27 ms total) > 1: > 1: [--] Global test environment tear-down > 1: [==] 6 tests from 6 test cases ran. (309081 ms total) > 1: [ PASSED ] 6 tests. > {code} > Either we should change the FPPTest parameters to be faster, or we should not > run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1382) [C++] Prepare for arrow::test namespace removal
[ https://issues.apache.org/jira/browse/PARQUET-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1382. --- Resolution: Fixed Fix Version/s: cpp-1.5.0 Issue resolved by pull request 487 [https://github.com/apache/parquet-cpp/pull/487] > [C++] Prepare for arrow::test namespace removal > --- > > Key: PARQUET-1382 > URL: https://issues.apache.org/jira/browse/PARQUET-1382 > Project: Parquet > Issue Type: Task > Components: parquet-cpp >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > ARROW-3059 will remove the {{arrow::test}} namespace, make sure the > parquet-cpp codebase doesn't break. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1382) [C++] Prepare for arrow::test namespace removal
[ https://issues.apache.org/jira/browse/PARQUET-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583972#comment-16583972 ] ASF GitHub Bot commented on PARQUET-1382: - wesm closed pull request #487: PARQUET-1382: [C++] Prepare for arrow::test namespace removal URL: https://github.com/apache/parquet-cpp/pull/487 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.travis.yml b/.travis.yml index 7918b890..e1faf68f 100644 --- a/.travis.yml +++ b/.travis.yml @@ -14,8 +14,15 @@ # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. + sudo: required dist: trusty + +language: cpp + +cache: + ccache: true + addons: apt: sources: @@ -35,6 +42,7 @@ addons: - bison - flex - pkg-config + matrix: fast_finish: true include: @@ -42,10 +50,7 @@ matrix: os: linux before_script: - export PARQUET_CXXFLAGS="-DARROW_NO_DEPRECATED_API" -- source $TRAVIS_BUILD_DIR/ci/before_script_travis.sh - - compiler: gcc -os: linux -before_script: +- export PARQUET_TRAVIS_VALGRIND=1 - source $TRAVIS_BUILD_DIR/ci/before_script_travis.sh - compiler: clang os: linux @@ -76,8 +81,6 @@ matrix: script: - $TRAVIS_BUILD_DIR/ci/travis_script_toolchain.sh -language: cpp - # PARQUET-626: revisit llvm toolchain when/if llvm.org apt repo resurfaces # before_install: diff --git a/ci/before_script_travis.sh b/ci/before_script_travis.sh index 95a2fd82..ce0234c0 100755 --- a/ci/before_script_travis.sh +++ b/ci/before_script_travis.sh @@ -28,15 +28,20 @@ fi export PARQUET_TEST_DATA=$TRAVIS_BUILD_DIR/data +CMAKE_COMMON_FLAGS="-DPARQUET_BUILD_WARNING_LEVEL=CHECKIN" + +if [ $PARQUET_TRAVIS_VALGRIND == "1" ]; then + CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -DPARQUET_TEST_MEMCHECK=ON" +fi + if [ $TRAVIS_OS_NAME == "linux" ]; then -cmake -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \ - -DPARQUET_TEST_MEMCHECK=ON \ +cmake $CMAKE_COMMON_FLAGS \ + -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \ -DPARQUET_BUILD_BENCHMARKS=ON \ - -DPARQUET_BUILD_WARNING_LEVEL=CHECKIN \ -DPARQUET_GENERATE_COVERAGE=1 \ $TRAVIS_BUILD_DIR else -cmake -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \ - -DPARQUET_BUILD_WARNING_LEVEL=CHECKIN \ +cmake $CMAKE_COMMON_FLAGS \ + -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \ $TRAVIS_BUILD_DIR fi diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index 0136819b..7a50c854 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -45,8 +45,8 @@ if defined need_vcvarsall ( if "%CONFIGURATION%" == "Toolchain" ( conda install -y boost-cpp=1.63 thrift-cpp=0.11.0 ^ - brotli=0.6.0 zlib=1.2.11 snappy=1.1.6 lz4-c=1.7.5 zstd=1.2.0 ^ - -c conda-forge + brotli=1.0.2 zlib=1.2.11 snappy=1.1.7 lz4-c=1.8.0 zstd=1.3.3 ^ + -c conda-forge || exit /B set ARROW_BUILD_TOOLCHAIN=%MINICONDA%/Library set PARQUET_BUILD_TOOLCHAIN=%MINICONDA%/Library diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index d3cef663..30313634 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -33,18 +33,18 @@ make lint # fi if [ $TRAVIS_OS_NAME == "linux" ]; then - make -j4 || exit 1 - ctest -VV -L unittest || { cat $TRAVIS_BUILD_DIR/parquet-build/Testing/Temporary/LastTest.log; exit 1; } + make -j4 + ctest -j2 -VV -L unittest # Current cpp-coveralls version 0.4 throws an error (PARQUET-1075) on Travis CI. Pin to last working version sudo pip install cpp_coveralls==0.3.12 export PARQUET_ROOT=$TRAVIS_BUILD_DIR $TRAVIS_BUILD_DIR/ci/upload_coverage.sh else - make -j4 || exit 1 + make -j4 BUILD_TYPE=debug EXECUTABLE_DIR=$CPP_BUILD_DIR/$BUILD_TYPE export LD_LIBRARY_PATH=$EXECUTABLE_DIR:$LD_LIBRARY_PATH - ctest -VV -L unittest || { cat $TRAVIS_BUILD_DIR/parquet-build/Testing/Temporary/LastTest.log; exit 1; } + ctest -j2 -VV -L unittest fi popd diff --git a/ci/travis_script_static.sh b/ci/travis_script_static.sh index b76ced8f..8af574e3 100755 --- a/ci/travis_script_static.sh +++ b/ci/travis_script_static.sh @@ -65,8 +65,14 @@ export ZLIB_STATIC_LIB=$ARROW_EP/zlib_ep/src/zlib_ep-install/lib/libz.a export LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a export ZSTD_STATIC_LIB=$ARROW_EP/zstd_ep-prefix/src/zstd_ep/lib/libzstd.a -cmake -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \ - -DPARQUET_TEST_MEMCHECK=ON \ +CMAKE_COMMON_FLAGS="-DPARQUET_BUILD_WARNING_LEVEL=CHECKIN" + +if [ $PARQUET_TRAVIS_VALGRIND == "1" ]; then + CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -DPARQUET_TEST_MEMCHECK=ON" +fi + +cmake
[jira] [Created] (PARQUET-1389) Improve value skipping at page synchronization
Gabor Szadovszky created PARQUET-1389: - Summary: Improve value skipping at page synchronization Key: PARQUET-1389 URL: https://issues.apache.org/jira/browse/PARQUET-1389 Project: Parquet Issue Type: Sub-task Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky Currently, value skipping is done one-by-one for page synchronization. There are encodings (e.g. plain) where several values can be skipped at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Parquet sync meeting minutes
Topics discussed and decisions (meeting held on 2018 August 15th, at 6pm CET / 9 am PST): - Aligning page row boundaries between different columns: Debated, please follow-up - Remove Java specific code from parquet-format: Accepted - Column encryption: Please review - Parquet-format release: Scope accepted - C++ mono-repo: Please vote Aligning page row boundaries between different columns (Gabor) -- Background: In the existing specification of column indexes, page boundaries are not aligned between different column in respect to row count. Gabor: implemented this logic, interested parties can review the code here: - https://github.com/apache/parquet-mr/pull/509 - https://github.com/apache/parquet-mr/commits/column-indexes Main takeaway from implementation: - Index filtering logic as currently specified is overcomplicated. - May become a maintenance burden and results in steep learning curve for onboarding - new developers. - Can not be made transparent, vectorized readers (Hive, Spark) have to implement a similar logic. Suggestion: - Align page row boundaries between different columns, i.e. the n-th page of every column should contain the same number of rows. - Filtering logic would be a lot simpler. - Vectorized readers will get index-based filtering without any change required on their side. Response: - Ryan doesn't recommend it. Performance numbers? - Discuss offline or on dev mailing list - Timeline for reaching decision? Within a week. (Gabor already has a working implementation.) Remove Java specific code from parquet-format (Nandor) -- Background: Parquet-format contains a few Java classes. Earlier no changes were required in these, but this has changed in recent features, especially with the new column encryption feature, which would add substantial new code. Suggestion (Nandor): Instead of cluttering parquet-format further with java-specific code, move these classes to parquet-mr and deprecate them in parquet-format. What is the motivation behind the status quo? Julien: We may need a different Thrift version in the parquet-thrift binding than in the parquet files themselves. If we move these classes to parquet-mr, we should shade thrift. Additionally, currently a thrift-compiler is only needed for parquet-format, not parquet-mr, this will change. Gabor: Dockerization may help. Julien: We could merge the two repos altogether as well. Gabor: This, however would move the specification into the Java implementation, which would be against the cross-language ideology, so let's keep the separate repo for the format. Zoltan: Other language binding should also consider directly using it instead of copying parquet.thrift into their source code. Column encryption (Gidon) - Under development: - Key management API (doesn’t provide E2E key management) (PARQUET-1373) - Anonymization and data masking (PARQUET-1376) Java PRs under review: - https://github.com/apache/parquet-mr/pull/471 - https://github.com/apache/parquet-mr/pull/472 C++ PR: - https://github.com/apache/parquet-cpp/pull/475 We need more testing (both unit tests and interop tests between Java and C++). Parquet-format release (Zoltan) --- Suggested scope (Zoltan): - Column encryption - Nanosec precision - Anything else? Discussion: - Nothing else to add. - Wes welcomes the nano precision, will be needed in parquet-cpp as well. C++ mono-repo: merging Arrow and parquet-cpp (Wes) -- Background: duplicated CI system and codebase, circular dependencies between libraries Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be read here: https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E Resolution: No objections but no final decision either, vote on the parquet list: https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E
[jira] [Commented] (PARQUET-1383) Parquet tools should print logical type instead of (or besides) original type
[ https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583826#comment-16583826 ] ASF GitHub Bot commented on PARQUET-1383: - nandorKollar opened a new pull request #513: PARQUET-1383: Parquet tools should print logical type instead of (or besides) original type URL: https://github.com/apache/parquet-mr/pull/513 This pull request addresses two topics: - write logical type in parquet tools meta besides original type - take to UTC normalized parameter into account when printing time/timestamp values (using stringifiers) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Parquet tools should print logical type instead of (or besides) original type > - > > Key: PARQUET-1383 > URL: https://issues.apache.org/jira/browse/PARQUET-1383 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Minor > Labels: pull-request-available > > Currently, parquet-tools should print original type. Since the new logical > type API is introduced, it would be better to print it instead of, or besides > the original type. > Also, the values written by the tools should take UTC normalized parameters > into account. Right now, every time and timestamp value is adjusted to UTC > when printed via parquet-tools -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1383) Parquet tools should print logical type instead of (or besides) original type
[ https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1383: Labels: pull-request-available (was: ) > Parquet tools should print logical type instead of (or besides) original type > - > > Key: PARQUET-1383 > URL: https://issues.apache.org/jira/browse/PARQUET-1383 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Minor > Labels: pull-request-available > > Currently, parquet-tools should print original type. Since the new logical > type API is introduced, it would be better to print it instead of, or besides > the original type. > Also, the values written by the tools should take UTC normalized parameters > into account. Right now, every time and timestamp value is adjusted to UTC > when printed via parquet-tools -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583819#comment-16583819 ] ASF GitHub Bot commented on PARQUET-1387: - nandorKollar opened a new pull request #102: PARQUET-1387: Nanosecond precision time and timestamp - parquet-format URL: https://github.com/apache/parquet-format/pull/102 Introduce new nanosecond precision in TimeUnit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Nanosecond precision time and timestamp - parquet-format > > > Key: PARQUET-1387 > URL: https://issues.apache.org/jira/browse/PARQUET-1387 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1387: Labels: pull-request-available (was: ) > Nanosecond precision time and timestamp - parquet-format > > > Key: PARQUET-1387 > URL: https://issues.apache.org/jira/browse/PARQUET-1387 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1388) Nanosecond precision time and timestamp - parquet-mr
Nandor Kollar created PARQUET-1388: -- Summary: Nanosecond precision time and timestamp - parquet-mr Key: PARQUET-1388 URL: https://issues.apache.org/jira/browse/PARQUET-1388 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Nandor Kollar -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandor Kollar updated PARQUET-1387: --- Fix Version/s: (was: format-2.6.0) > Nanosecond precision time and timestamp - parquet-format > > > Key: PARQUET-1387 > URL: https://issues.apache.org/jira/browse/PARQUET-1387 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandor Kollar updated PARQUET-1387: --- Fix Version/s: format-2.6.0 > Nanosecond precision time and timestamp - parquet-format > > > Key: PARQUET-1387 > URL: https://issues.apache.org/jira/browse/PARQUET-1387 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Fix For: format-2.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format
Nandor Kollar created PARQUET-1387: -- Summary: Nanosecond precision time and timestamp - parquet-format Key: PARQUET-1387 URL: https://issues.apache.org/jira/browse/PARQUET-1387 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Nandor Kollar Assignee: Nandor Kollar -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1386) Fix issues of NaN and +-0.0 in case of float/double column indexes
Gabor Szadovszky created PARQUET-1386: - Summary: Fix issues of NaN and +-0.0 in case of float/double column indexes Key: PARQUET-1386 URL: https://issues.apache.org/jira/browse/PARQUET-1386 Project: Parquet Issue Type: Sub-task Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky Workaround the float/double column indexes just like we did for statistics in PARQUET-1246. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
[ https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583489#comment-16583489 ] Junjie Chen commented on PARQUET-1385: -- std::seed_seq::generate takes more than 75% cpu cycles from perf. we can change to use system clock as seed to optimize this to about 1/5 time (on my machine). Anyway skip this in case of valgrind is also ok. > [C++] bloom_filter-test is very slow under valgrind > --- > > Key: PARQUET-1385 > URL: https://issues.apache.org/jira/browse/PARQUET-1385 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > This test takes ~5 minutes to run under valgrind in Travis CI > {code} > 1: [==] Running 6 tests from 6 test cases. > 1: [--] Global test environment set-up. > 1: [--] 1 test from Murmur3Test > 1: [ RUN ] Murmur3Test.TestBloomFilter > 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) > 1: [--] 1 test from Murmur3Test (34 ms total) > 1: > 1: [--] 1 test from ConstructorTest > 1: [ RUN ] ConstructorTest.TestBloomFilter > 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) > 1: [--] 1 test from ConstructorTest (101 ms total) > 1: > 1: [--] 1 test from BasicTest > 1: [ RUN ] BasicTest.TestBloomFilter > 1: [ OK ] BasicTest.TestBloomFilter (49 ms) > 1: [--] 1 test from BasicTest (49 ms total) > 1: > 1: [--] 1 test from FPPTest > 1: [ RUN ] FPPTest.TestBloomFilter > 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) > 1: [--] 1 test from FPPTest (308741 ms total) > 1: > 1: [--] 1 test from CompatibilityTest > 1: [ RUN ] CompatibilityTest.TestBloomFilter > 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) > 1: [--] 1 test from CompatibilityTest (62 ms total) > 1: > 1: [--] 1 test from OptimalValueTest > 1: [ RUN ] OptimalValueTest.TestBloomFilter > 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) > 1: [--] 1 test from OptimalValueTest (27 ms total) > 1: > 1: [--] Global test environment tear-down > 1: [==] 6 tests from 6 test cases ran. (309081 ms total) > 1: [ PASSED ] 6 tests. > {code} > Either we should change the FPPTest parameters to be faster, or we should not > run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1308) [C++] parquet::arrow should use thread pool, not ParallelFor
[ https://issues.apache.org/jira/browse/PARQUET-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1308. --- Resolution: Fixed Fix Version/s: cpp-1.5.0 Issue resolved by pull request 467 [https://github.com/apache/parquet-cpp/pull/467] > [C++] parquet::arrow should use thread pool, not ParallelFor > > > Key: PARQUET-1308 > URL: https://issues.apache.org/jira/browse/PARQUET-1308 > Project: Parquet > Issue Type: Task > Components: parquet-cpp >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > Arrow now has a global thread pool, parquet::arrow should use that instead of > ParallelFor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1308) [C++] parquet::arrow should use thread pool, not ParallelFor
[ https://issues.apache.org/jira/browse/PARQUET-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583431#comment-16583431 ] ASF GitHub Bot commented on PARQUET-1308: - wesm closed pull request #467: PARQUET-1308: [C++] Use Arrow thread pool, not Arrow ParallelFor, fix deprecated APIs, upgrade clang-format version. Fix record delimiting bug URL: https://github.com/apache/parquet-cpp/pull/467 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/benchmarks/decode_benchmark.cc b/benchmarks/decode_benchmark.cc index 8f2dfa07..3ae32b4c 100644 --- a/benchmarks/decode_benchmark.cc +++ b/benchmarks/decode_benchmark.cc @@ -42,8 +42,8 @@ class DeltaBitPackEncoder { uint8_t* Encode(int* encoded_len) { uint8_t* result = new uint8_t[10 * 1024 * 1024]; -int num_mini_blocks = static_cast(arrow::BitUtil::Ceil(num_values() - 1, - mini_block_size_)); +int num_mini_blocks = static_cast(arrow::BitUtil::CeilDiv(num_values() - 1, + mini_block_size_)); uint8_t* mini_block_widths = NULL; arrow::BitWriter writer(result, 10 * 1024 * 1024); diff --git a/cmake_modules/ArrowExternalProject.cmake b/cmake_modules/ArrowExternalProject.cmake index 4f23661e..3d1a2760 100644 --- a/cmake_modules/ArrowExternalProject.cmake +++ b/cmake_modules/ArrowExternalProject.cmake @@ -46,7 +46,7 @@ if (MSVC AND PARQUET_USE_STATIC_CRT) endif() if ("$ENV{PARQUET_ARROW_VERSION}" STREQUAL "") - set(ARROW_VERSION "501d60e918bd4d10c429ab34e0b8e8a87dffb732") + set(ARROW_VERSION "3edfd7caf2746eeba37d5ac7bfd3665cc159e7ad") else() set(ARROW_VERSION "$ENV{PARQUET_ARROW_VERSION}") endif() diff --git a/cmake_modules/FindClangTools.cmake b/cmake_modules/FindClangTools.cmake index 215a5cd9..56e2dd77 100644 --- a/cmake_modules/FindClangTools.cmake +++ b/cmake_modules/FindClangTools.cmake @@ -96,7 +96,9 @@ if (CLANG_FORMAT_VERSION) endif() else() find_program(CLANG_FORMAT_BIN - NAMES clang-format-4.0 + NAMES clang-format-6.0 + clang-format-5.0 + clang-format-4.0 clang-format-3.9 clang-format-3.8 clang-format-3.7 diff --git a/cmake_modules/SetupCxxFlags.cmake b/cmake_modules/SetupCxxFlags.cmake index 01ed85bf..5ca3f4ef 100644 --- a/cmake_modules/SetupCxxFlags.cmake +++ b/cmake_modules/SetupCxxFlags.cmake @@ -84,6 +84,7 @@ if ("${UPPERCASE_BUILD_WARNING_LEVEL}" STREQUAL "CHECKIN") -Wno-shadow -Wno-switch-enum -Wno-exit-time-destructors \ -Wno-global-constructors -Wno-weak-template-vtables -Wno-undefined-reinterpret-cast \ -Wno-implicit-fallthrough -Wno-unreachable-code-return \ +-Wno-documentation-deprecated-sync \ -Wno-float-equal -Wno-missing-prototypes \ -Wno-old-style-cast -Wno-covered-switch-default \ -Wno-format-nonliteral -Wno-missing-noreturn \ diff --git a/src/parquet/arrow/arrow-reader-writer-benchmark.cc b/src/parquet/arrow/arrow-reader-writer-benchmark.cc index 15d2cf72..51eb0c23 100644 --- a/src/parquet/arrow/arrow-reader-writer-benchmark.cc +++ b/src/parquet/arrow/arrow-reader-writer-benchmark.cc @@ -104,9 +104,9 @@ std::shared_ptr<::arrow::Table> TableFromVector( std::vector valid_bytes(BENCHMARK_SIZE, 0); int n = {0}; std::generate(valid_bytes.begin(), valid_bytes.end(), [] { return n++ % 2; }); -EXIT_NOT_OK(builder.Append(vec.data(), vec.size(), valid_bytes.data())); +EXIT_NOT_OK(builder.AppendValues(vec.data(), vec.size(), valid_bytes.data())); } else { -EXIT_NOT_OK(builder.Append(vec.data(), vec.size(), nullptr)); +EXIT_NOT_OK(builder.AppendValues(vec.data(), vec.size(), nullptr)); } std::shared_ptr<::arrow::Array> array; EXIT_NOT_OK(builder.Finish()); @@ -126,9 +126,9 @@ std::shared_ptr<::arrow::Table> TableFromVector(const std::vector array; EXIT_NOT_OK(builder.Finish()); diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc b/src/parquet/arrow/arrow-reader-writer-test.cc index d4f5b000..be3e6114 100644 --- a/src/parquet/arrow/arrow-reader-writer-test.cc +++ b/src/parquet/arrow/arrow-reader-writer-test.cc @@ -320,8 +320,7 @@ using ParquetDataType = DataType::parquet_enum>; template using ParquetWriter = TypedColumnWriter>; -void WriteTableToBuffer(const std::shared_ptr& table, int num_threads, -int64_t row_group_size, +void WriteTableToBuffer(const std::shared_ptr& table, int64_t row_group_size, const std::shared_ptr& arrow_properties, std::shared_ptr* out) { auto sink = std::make_shared(); @@ -399,21 +398,21 @@ void AssertTablesEqual(const Table& expected, const Table&
[jira] [Updated] (PARQUET-1308) [C++] parquet::arrow should use thread pool, not ParallelFor
[ https://issues.apache.org/jira/browse/PARQUET-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1308: Labels: pull-request-available (was: ) > [C++] parquet::arrow should use thread pool, not ParallelFor > > > Key: PARQUET-1308 > URL: https://issues.apache.org/jira/browse/PARQUET-1308 > Project: Parquet > Issue Type: Task > Components: parquet-cpp >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > Arrow now has a global thread pool, parquet::arrow should use that instead of > ParallelFor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
[ https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583425#comment-16583425 ] Junjie Chen commented on PARQUET-1385: -- The GetRandomString function is very slow, I can change to test count to 1/10 size. Or let me optimize it for a while. > [C++] bloom_filter-test is very slow under valgrind > --- > > Key: PARQUET-1385 > URL: https://issues.apache.org/jira/browse/PARQUET-1385 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > This test takes ~5 minutes to run under valgrind in Travis CI > {code} > 1: [==] Running 6 tests from 6 test cases. > 1: [--] Global test environment set-up. > 1: [--] 1 test from Murmur3Test > 1: [ RUN ] Murmur3Test.TestBloomFilter > 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) > 1: [--] 1 test from Murmur3Test (34 ms total) > 1: > 1: [--] 1 test from ConstructorTest > 1: [ RUN ] ConstructorTest.TestBloomFilter > 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) > 1: [--] 1 test from ConstructorTest (101 ms total) > 1: > 1: [--] 1 test from BasicTest > 1: [ RUN ] BasicTest.TestBloomFilter > 1: [ OK ] BasicTest.TestBloomFilter (49 ms) > 1: [--] 1 test from BasicTest (49 ms total) > 1: > 1: [--] 1 test from FPPTest > 1: [ RUN ] FPPTest.TestBloomFilter > 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) > 1: [--] 1 test from FPPTest (308741 ms total) > 1: > 1: [--] 1 test from CompatibilityTest > 1: [ RUN ] CompatibilityTest.TestBloomFilter > 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) > 1: [--] 1 test from CompatibilityTest (62 ms total) > 1: > 1: [--] 1 test from OptimalValueTest > 1: [ RUN ] OptimalValueTest.TestBloomFilter > 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) > 1: [--] 1 test from OptimalValueTest (27 ms total) > 1: > 1: [--] Global test environment tear-down > 1: [==] 6 tests from 6 test cases ran. (309081 ms total) > 1: [ PASSED ] 6 tests. > {code} > Either we should change the FPPTest parameters to be faster, or we should not > run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc
[ https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1384: Labels: pull-request-available (was: ) > [C++] Clang compiler warnings in bloom_filter-test.cc > - > > Key: PARQUET-1384 > URL: https://issues.apache.org/jira/browse/PARQUET-1384 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Junjie Chen >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > {code} > [69/95] Building CXX object > src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o > ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^ > ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^~ > ~ > ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > std::move(BlockSplitBloomFilter::Deserialize()); > ^ > ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here > std::move(BlockSplitBloomFilter::Deserialize()); > ^~ ~ > ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Seek(0); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->GetSize(); > ^~~ ~ > ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
[ https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1385: Labels: pull-request-available (was: ) > [C++] bloom_filter-test is very slow under valgrind > --- > > Key: PARQUET-1385 > URL: https://issues.apache.org/jira/browse/PARQUET-1385 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > This test takes ~5 minutes to run under valgrind in Travis CI > {code} > 1: [==] Running 6 tests from 6 test cases. > 1: [--] Global test environment set-up. > 1: [--] 1 test from Murmur3Test > 1: [ RUN ] Murmur3Test.TestBloomFilter > 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) > 1: [--] 1 test from Murmur3Test (34 ms total) > 1: > 1: [--] 1 test from ConstructorTest > 1: [ RUN ] ConstructorTest.TestBloomFilter > 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) > 1: [--] 1 test from ConstructorTest (101 ms total) > 1: > 1: [--] 1 test from BasicTest > 1: [ RUN ] BasicTest.TestBloomFilter > 1: [ OK ] BasicTest.TestBloomFilter (49 ms) > 1: [--] 1 test from BasicTest (49 ms total) > 1: > 1: [--] 1 test from FPPTest > 1: [ RUN ] FPPTest.TestBloomFilter > 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) > 1: [--] 1 test from FPPTest (308741 ms total) > 1: > 1: [--] 1 test from CompatibilityTest > 1: [ RUN ] CompatibilityTest.TestBloomFilter > 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) > 1: [--] 1 test from CompatibilityTest (62 ms total) > 1: > 1: [--] 1 test from OptimalValueTest > 1: [ RUN ] OptimalValueTest.TestBloomFilter > 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) > 1: [--] 1 test from OptimalValueTest (27 ms total) > 1: > 1: [--] Global test environment tear-down > 1: [==] 6 tests from 6 test cases ran. (309081 ms total) > 1: [ PASSED ] 6 tests. > {code} > Either we should change the FPPTest parameters to be faster, or we should not > run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
[ https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583419#comment-16583419 ] ASF GitHub Bot commented on PARQUET-1385: - wesm opened a new pull request #489: PARQUET-1385: Do not run TestBloomFilter.FPPTest when valgrind is in use URL: https://github.com/apache/parquet-cpp/pull/489 This test will still be run in other entries of Travis CI where valgrind is not being used This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] bloom_filter-test is very slow under valgrind > --- > > Key: PARQUET-1385 > URL: https://issues.apache.org/jira/browse/PARQUET-1385 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > This test takes ~5 minutes to run under valgrind in Travis CI > {code} > 1: [==] Running 6 tests from 6 test cases. > 1: [--] Global test environment set-up. > 1: [--] 1 test from Murmur3Test > 1: [ RUN ] Murmur3Test.TestBloomFilter > 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) > 1: [--] 1 test from Murmur3Test (34 ms total) > 1: > 1: [--] 1 test from ConstructorTest > 1: [ RUN ] ConstructorTest.TestBloomFilter > 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) > 1: [--] 1 test from ConstructorTest (101 ms total) > 1: > 1: [--] 1 test from BasicTest > 1: [ RUN ] BasicTest.TestBloomFilter > 1: [ OK ] BasicTest.TestBloomFilter (49 ms) > 1: [--] 1 test from BasicTest (49 ms total) > 1: > 1: [--] 1 test from FPPTest > 1: [ RUN ] FPPTest.TestBloomFilter > 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) > 1: [--] 1 test from FPPTest (308741 ms total) > 1: > 1: [--] 1 test from CompatibilityTest > 1: [ RUN ] CompatibilityTest.TestBloomFilter > 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) > 1: [--] 1 test from CompatibilityTest (62 ms total) > 1: > 1: [--] 1 test from OptimalValueTest > 1: [ RUN ] OptimalValueTest.TestBloomFilter > 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) > 1: [--] 1 test from OptimalValueTest (27 ms total) > 1: > 1: [--] Global test environment tear-down > 1: [==] 6 tests from 6 test cases ran. (309081 ms total) > 1: [ PASSED ] 6 tests. > {code} > Either we should change the FPPTest parameters to be faster, or we should not > run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc
[ https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583417#comment-16583417 ] ASF GitHub Bot commented on PARQUET-1384: - cjjnjust opened a new pull request #488: PARQUET-1384: fix clang build error for bloom_filter-test.cc URL: https://github.com/apache/parquet-cpp/pull/488 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Clang compiler warnings in bloom_filter-test.cc > - > > Key: PARQUET-1384 > URL: https://issues.apache.org/jira/browse/PARQUET-1384 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Junjie Chen >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > {code} > [69/95] Building CXX object > src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o > ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^ > ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here > BlockSplitBloomFilter de_bloom = > std::move(BlockSplitBloomFilter::Deserialize()); >^~ > ~ > ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object > prevents copy elision [-Wpessimizing-move] > std::move(BlockSplitBloomFilter::Deserialize()); > ^ > ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here > std::move(BlockSplitBloomFilter::Deserialize()); > ^~ ~ > ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Seek(0); > ^~~~ ~ > ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->GetSize(); > ^~~ ~ > ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of > function declared with 'warn_unused_result' attribute [-Wunused-result] > handle->Read(size, ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
[ https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned PARQUET-1385: - Assignee: Wes McKinney > [C++] bloom_filter-test is very slow under valgrind > --- > > Key: PARQUET-1385 > URL: https://issues.apache.org/jira/browse/PARQUET-1385 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: cpp-1.5.0 > > > This test takes ~5 minutes to run under valgrind in Travis CI > {code} > 1: [==] Running 6 tests from 6 test cases. > 1: [--] Global test environment set-up. > 1: [--] 1 test from Murmur3Test > 1: [ RUN ] Murmur3Test.TestBloomFilter > 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) > 1: [--] 1 test from Murmur3Test (34 ms total) > 1: > 1: [--] 1 test from ConstructorTest > 1: [ RUN ] ConstructorTest.TestBloomFilter > 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) > 1: [--] 1 test from ConstructorTest (101 ms total) > 1: > 1: [--] 1 test from BasicTest > 1: [ RUN ] BasicTest.TestBloomFilter > 1: [ OK ] BasicTest.TestBloomFilter (49 ms) > 1: [--] 1 test from BasicTest (49 ms total) > 1: > 1: [--] 1 test from FPPTest > 1: [ RUN ] FPPTest.TestBloomFilter > 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) > 1: [--] 1 test from FPPTest (308741 ms total) > 1: > 1: [--] 1 test from CompatibilityTest > 1: [ RUN ] CompatibilityTest.TestBloomFilter > 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) > 1: [--] 1 test from CompatibilityTest (62 ms total) > 1: > 1: [--] 1 test from OptimalValueTest > 1: [ RUN ] OptimalValueTest.TestBloomFilter > 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) > 1: [--] 1 test from OptimalValueTest (27 ms total) > 1: > 1: [--] Global test environment tear-down > 1: [==] 6 tests from 6 test cases ran. (309081 ms total) > 1: [ PASSED ] 6 tests. > {code} > Either we should change the FPPTest parameters to be faster, or we should not > run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind
Wes McKinney created PARQUET-1385: - Summary: [C++] bloom_filter-test is very slow under valgrind Key: PARQUET-1385 URL: https://issues.apache.org/jira/browse/PARQUET-1385 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Wes McKinney Fix For: cpp-1.5.0 This test takes ~5 minutes to run under valgrind in Travis CI {code} 1: [==] Running 6 tests from 6 test cases. 1: [--] Global test environment set-up. 1: [--] 1 test from Murmur3Test 1: [ RUN ] Murmur3Test.TestBloomFilter 1: [ OK ] Murmur3Test.TestBloomFilter (19 ms) 1: [--] 1 test from Murmur3Test (34 ms total) 1: 1: [--] 1 test from ConstructorTest 1: [ RUN ] ConstructorTest.TestBloomFilter 1: [ OK ] ConstructorTest.TestBloomFilter (101 ms) 1: [--] 1 test from ConstructorTest (101 ms total) 1: 1: [--] 1 test from BasicTest 1: [ RUN ] BasicTest.TestBloomFilter 1: [ OK ] BasicTest.TestBloomFilter (49 ms) 1: [--] 1 test from BasicTest (49 ms total) 1: 1: [--] 1 test from FPPTest 1: [ RUN ] FPPTest.TestBloomFilter 1: [ OK ] FPPTest.TestBloomFilter (308731 ms) 1: [--] 1 test from FPPTest (308741 ms total) 1: 1: [--] 1 test from CompatibilityTest 1: [ RUN ] CompatibilityTest.TestBloomFilter 1: [ OK ] CompatibilityTest.TestBloomFilter (62 ms) 1: [--] 1 test from CompatibilityTest (62 ms total) 1: 1: [--] 1 test from OptimalValueTest 1: [ RUN ] OptimalValueTest.TestBloomFilter 1: [ OK ] OptimalValueTest.TestBloomFilter (27 ms) 1: [--] 1 test from OptimalValueTest (27 ms total) 1: 1: [--] Global test environment tear-down 1: [==] 6 tests from 6 test cases ran. (309081 ms total) 1: [ PASSED ] 6 tests. {code} Either we should change the FPPTest parameters to be faster, or we should not run that test when using valrind -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1380) [C++] move Bloom filter test binary to parquet-testing repo
[ https://issues.apache.org/jira/browse/PARQUET-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-1380: -- Summary: [C++] move Bloom filter test binary to parquet-testing repo (was: move Bloom filter test binary to parquet-testing repo) > [C++] move Bloom filter test binary to parquet-testing repo > --- > > Key: PARQUET-1380 > URL: https://issues.apache.org/jira/browse/PARQUET-1380 > Project: Parquet > Issue Type: Sub-task > Components: parquet-cpp >Reporter: Junjie Chen >Assignee: Junjie Chen >Priority: Minor > Fix For: cpp-1.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)