[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584436#comment-16584436
 ] 

ASF GitHub Bot commented on PARQUET-1369:
-

rgruener opened a new pull request #491: PARQUET-1369: Disregard column sort 
order if statistics max/min are equal
URL: https://github.com/apache/parquet-cpp/pull/491
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1369:

Labels: parquet pull-request-available  (was: parquet)

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool

2018-08-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1256:
-

Assignee: Jacek Pliszka

> [C++] Add --print-key-value-metadata option to parquet_reader tool
> --
>
> Key: PARQUET-1256
> URL: https://issues.apache.org/jira/browse/PARQUET-1256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jacek Pliszka
>Assignee: Jacek Pliszka
>Priority: Trivial
>  Labels: patch, pull-request-available
> Fix For: cpp-1.5.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Added --print-key-value-metadata option to parquet_reader tool
> https://github.com/apache/parquet-cpp/pull/450
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool

2018-08-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1256.
---
Resolution: Fixed

Issue resolved by pull request 450
[https://github.com/apache/parquet-cpp/pull/450]

> [C++] Add --print-key-value-metadata option to parquet_reader tool
> --
>
> Key: PARQUET-1256
> URL: https://issues.apache.org/jira/browse/PARQUET-1256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jacek Pliszka
>Priority: Trivial
>  Labels: patch, pull-request-available
> Fix For: cpp-1.5.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Added --print-key-value-metadata option to parquet_reader tool
> https://github.com/apache/parquet-cpp/pull/450
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584429#comment-16584429
 ] 

ASF GitHub Bot commented on PARQUET-1256:
-

wesm closed pull request #450: PARQUET-1256: Add --print-key-value-metadata 
option to parquet_reader tool
URL: https://github.com/apache/parquet-cpp/pull/450
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/printer.cc b/src/parquet/printer.cc
index 3f18a5c8..9f26a418 100644
--- a/src/parquet/printer.cc
+++ b/src/parquet/printer.cc
@@ -33,13 +33,25 @@ namespace parquet {
 #define COL_WIDTH "30"
 
 void ParquetFilePrinter::DebugPrint(std::ostream& stream, std::list 
selected_columns,
-bool print_values, const char* filename) {
+bool print_values, bool 
print_key_value_metadata,
+const char* filename) {
   const FileMetaData* file_metadata = fileReader->metadata().get();
 
   stream << "File Name: " << filename << "\n";
   stream << "Version: " << file_metadata->version() << "\n";
   stream << "Created By: " << file_metadata->created_by() << "\n";
   stream << "Total rows: " << file_metadata->num_rows() << "\n";
+
+  if (print_key_value_metadata) {
+auto key_value_metadata = file_metadata->key_value_metadata();
+int64_t size_of_key_value_metadata = key_value_metadata->size();
+stream << "Key Value File Metadata: " << size_of_key_value_metadata << " 
entries\n";
+for (int64_t i = 0; i < size_of_key_value_metadata; i++) {
+  stream << " Key nr " << i << " " << key_value_metadata->key(i) << ": "
+ << key_value_metadata->value(i) << "\n";
+}
+  }
+
   stream << "Number of RowGroups: " << file_metadata->num_row_groups() << "\n";
   stream << "Number of Real Columns: "
  << file_metadata->schema()->group_node()->field_count() << "\n";
diff --git a/src/parquet/printer.h b/src/parquet/printer.h
index 3b828829..1113c3fe 100644
--- a/src/parquet/printer.h
+++ b/src/parquet/printer.h
@@ -38,7 +38,8 @@ class PARQUET_EXPORT ParquetFilePrinter {
   ~ParquetFilePrinter() {}
 
   void DebugPrint(std::ostream& stream, std::list selected_columns,
-  bool print_values = true, const char* fileame = "No Name");
+  bool print_values = true, bool print_key_value_metadata = 
false,
+  const char* filename = "No Name");
 
   void JSONPrint(std::ostream& stream, std::list selected_columns,
  const char* filename = "No Name");
diff --git a/tools/parquet_reader.cc b/tools/parquet_reader.cc
index 7ef59dc1..34bdfc10 100644
--- a/tools/parquet_reader.cc
+++ b/tools/parquet_reader.cc
@@ -24,13 +24,14 @@
 int main(int argc, char** argv) {
   if (argc > 5 || argc < 2) {
 std::cerr << "Usage: parquet_reader [--only-metadata] [--no-memory-map] 
[--json]"
- "[--columns=...] "
+ "[--print-key-value-metadata] [--columns=...] "
   << std::endl;
 return -1;
   }
 
   std::string filename;
   bool print_values = true;
+  bool print_key_value_metadata = false;
   bool memory_map = true;
   bool format_json = false;
 
@@ -42,6 +43,8 @@ int main(int argc, char** argv) {
   for (int i = 1; i < argc; i++) {
 if ((param = std::strstr(argv[i], "--only-metadata"))) {
   print_values = false;
+} else if ((param = std::strstr(argv[i], "--print-key-value-metadata"))) {
+  print_key_value_metadata = true;
 } else if ((param = std::strstr(argv[i], "--no-memory-map"))) {
   memory_map = false;
 } else if ((param = std::strstr(argv[i], "--json"))) {
@@ -64,7 +67,8 @@ int main(int argc, char** argv) {
 if (format_json) {
   printer.JSONPrint(std::cout, columns, filename.c_str());
 } else {
-  printer.DebugPrint(std::cout, columns, print_values, filename.c_str());
+  printer.DebugPrint(std::cout, columns, print_values,
+print_key_value_metadata, filename.c_str());
 }
   } catch (const std::exception& e) {
 std::cerr << "Parquet error: " << e.what() << std::endl;


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add --print-key-value-metadata option to parquet_reader tool
> --
>
> Key: PARQUET-1256
> URL: https://issues.apache.org/jira/browse/PARQUET-1256
> Project: 

[jira] [Updated] (PARQUET-1256) [C++] Add --print-key-value-metadata option to parquet_reader tool

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1256:

Labels: patch pull-request-available  (was: patch)

> [C++] Add --print-key-value-metadata option to parquet_reader tool
> --
>
> Key: PARQUET-1256
> URL: https://issues.apache.org/jira/browse/PARQUET-1256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jacek Pliszka
>Priority: Trivial
>  Labels: patch, pull-request-available
> Fix For: cpp-1.5.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Added --print-key-value-metadata option to parquet_reader tool
> https://github.com/apache/parquet-cpp/pull/450
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1370) [C++] Read consecutive column chunks in a single scan

2018-08-17 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584397#comment-16584397
 ] 

Robert Gruener edited comment on PARQUET-1370 at 8/17/18 9:20 PM:
--

That seems to only be the case for python3. Do the pyarrow file handles not 
implement RawIOBase in python2 as well? As far as I can tell the code does not 
suggest that though those have been my results.


was (Author: rgruener):
That seems to only be the case for python3. Do the pyarrow file handles no 
implement RawIOBase in python2 as well? As far as I can tell the code does not 
suggest that though those have been my results.

> [C++] Read consecutive column chunks in a single scan
> -
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1370) [C++] Read consecutive column chunks in a single scan

2018-08-17 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584397#comment-16584397
 ] 

Robert Gruener commented on PARQUET-1370:
-

That seems to only be the case for python3. Do the pyarrow file handles no 
implement RawIOBase in python2 as well? As far as I can tell the code does not 
suggest that though those have been my results.

> [C++] Read consecutive column chunks in a single scan
> -
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584161#comment-16584161
 ] 

ASF GitHub Bot commented on PARQUET-1384:
-

wesm closed pull request #490: PARQUET-1384: fix clang build error for 
bloom_filter-test.cc
URL: https://github.com/apache/parquet-cpp/pull/490
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/bloom_filter-test.cc b/src/parquet/bloom_filter-test.cc
index 69583af5..96d2e065 100644
--- a/src/parquet/bloom_filter-test.cc
+++ b/src/parquet/bloom_filter-test.cc
@@ -165,7 +165,7 @@ TEST(CompatibilityTest, TestBloomFilter) {
 
   std::unique_ptr bitset(new uint8_t[size]());
   std::shared_ptr buffer(new Buffer(bitset.get(), size));
-  handle->Read(size, );
+  PARQUET_THROW_NOT_OK(handle->Read(size, ));
 
   InMemoryInputStream source(buffer);
   BlockSplitBloomFilter bloom_filter1 = 
BlockSplitBloomFilter::Deserialize();
@@ -192,10 +192,10 @@ TEST(CompatibilityTest, TestBloomFilter) {
   bloom_filter2.WriteTo();
   std::shared_ptr buffer1 = sink.GetBuffer();
 
-  handle->Seek(0);
-  handle->GetSize();
+  PARQUET_THROW_NOT_OK(handle->Seek(0));
+  PARQUET_THROW_NOT_OK(handle->GetSize());
   std::shared_ptr buffer2;
-  handle->Read(size, );
+  PARQUET_THROW_NOT_OK(handle->Read(size, ));
 
   EXPECT_TRUE((*buffer1).Equals(*buffer2));
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Clang compiler warnings in bloom_filter-test.cc
> -
>
> Key: PARQUET-1384
> URL: https://issues.apache.org/jira/browse/PARQUET-1384
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Junjie Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> {code}
> [69/95] Building CXX object 
> src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o
> ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^
> ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^~ 
>   ~
> ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^
> ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^~   ~
> ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Seek(0);
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->GetSize();
>   ^~~ ~
> ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Parquet sync meeting minutes

2018-08-17 Thread Zoltan Ivanfi
Hi,

Sorry, that was an error on my side, I suggested Nandor to add a TLDR
section with this title. I agree with your comment, Wes, outcome would have
been a better choice of word than decision.

Br,

Zoltan

On Fri, Aug 17, 2018 at 6:36 PM Wes McKinney  wrote:

> hi Nandor,
>
> A fine detail, and I may be wrong, but I don't think decisions can
> technically be made on a call because time zones do not permit
> everyone to join always and not all collaborators are comfortable
> having live discussions in English. see [1]
>
> You can present the consensus of the participants in the call summary
> and others in the community have an opportunity to provide feedback.
> The "decision" is therefore one based on lazy consensus thereafter if
> there are no objections or follow up discussion
>
> - Wes
>
> [1]: https://www.apache.org/foundation/how-it-works.html#management
>
> On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar
>  wrote:
> > Topics discussed and decisions (meeting held on 2018 August 15th, at
> > 6pm CET / 9 am PST):
> >
> > - Aligning page row boundaries between different columns: Debated,
> > please follow-up
> > - Remove Java specific code from parquet-format: Accepted
> > - Column encryption: Please review
> > - Parquet-format release: Scope accepted
> > - C++ mono-repo: Please vote
> >
> >
> >
> > Aligning page row boundaries between different columns (Gabor)
> > --
> >
> > Background: In the existing specification of column indexes, page
> > boundaries are not aligned between different column in respect to row
> > count.
> >
> > Gabor: implemented this logic, interested parties can review the code
> here:
> > - https://github.com/apache/parquet-mr/pull/509
> > - https://github.com/apache/parquet-mr/commits/column-indexes
> >
> > Main takeaway from implementation:
> >
> > - Index filtering logic as currently specified is overcomplicated.
> > - May become a maintenance burden and results in steep learning curve
> > for onboarding - new developers.
> > - Can not be made transparent, vectorized readers (Hive, Spark) have
> > to implement a similar logic.
> >
> > Suggestion:
> >
> > - Align page row boundaries between different columns, i.e. the n-th
> > page of every column should contain the same number of rows.
> > - Filtering logic would be a lot simpler.
> > - Vectorized readers will get index-based filtering without any change
> > required on their side.
> >
> > Response:
> > - Ryan doesn't recommend it. Performance numbers?
> > - Discuss offline or on dev mailing list
> > - Timeline for reaching decision? Within a week. (Gabor already has a
> > working implementation.)
> >
> >
> >
> > Remove Java specific code from parquet-format (Nandor)
> > --
> >
> > Background: Parquet-format contains a few Java classes. Earlier no
> > changes were required in these, but this has changed in recent
> > features, especially with the new column encryption feature, which
> > would add substantial new code.
> >
> > Suggestion (Nandor): Instead of cluttering parquet-format further with
> > java-specific code, move these classes to parquet-mr and deprecate
> > them in parquet-format.
> >
> > What is the motivation behind the status quo? Julien: We may need a
> > different Thrift version in the parquet-thrift binding than in the
> > parquet files themselves. If we move these classes to parquet-mr, we
> > should shade thrift. Additionally, currently a thrift-compiler is only
> > needed for parquet-format, not parquet-mr, this will change. Gabor:
> > Dockerization may help.
> >
> > Julien: We could merge the two repos altogether as well. Gabor: This,
> > however would move the specification into the Java implementation,
> > which would be against the cross-language ideology, so let's keep the
> > separate repo for the format. Zoltan: Other language binding should
> > also consider directly using it instead of copying parquet.thrift into
> > their source code.
> >
> >
> >
> > Column encryption (Gidon)
> > -
> >
> > Under development:
> > - Key management API (doesn’t provide E2E key management) (PARQUET-1373)
> > - Anonymization and data masking (PARQUET-1376)
> >
> > Java PRs under review:
> > - https://github.com/apache/parquet-mr/pull/471
> > - https://github.com/apache/parquet-mr/pull/472
> >
> > C++ PR:
> > - https://github.com/apache/parquet-cpp/pull/475
> >
> >
> > We need more testing (both unit tests and interop tests between Java and
> C++).
> >
> >
> >
> > Parquet-format release (Zoltan)
> > ---
> >
> > Suggested scope (Zoltan):
> > - Column encryption
> > - Nanosec precision
> > - Anything else?
> >
> > Discussion:
> > - Nothing else to add.
> > - Wes welcomes the nano precision, will be needed in parquet-cpp as well.
> >
> >
> >
> > C++ mono-repo: merging Arrow and parquet-cpp (Wes)
> > --
> >
> >
> 

Re: Parquet sync meeting minutes

2018-08-17 Thread Wes McKinney
hi Nandor,

A fine detail, and I may be wrong, but I don't think decisions can
technically be made on a call because time zones do not permit
everyone to join always and not all collaborators are comfortable
having live discussions in English. see [1]

You can present the consensus of the participants in the call summary
and others in the community have an opportunity to provide feedback.
The "decision" is therefore one based on lazy consensus thereafter if
there are no objections or follow up discussion

- Wes

[1]: https://www.apache.org/foundation/how-it-works.html#management

On Fri, Aug 17, 2018 at 8:38 AM, Nandor Kollar
 wrote:
> Topics discussed and decisions (meeting held on 2018 August 15th, at
> 6pm CET / 9 am PST):
>
> - Aligning page row boundaries between different columns: Debated,
> please follow-up
> - Remove Java specific code from parquet-format: Accepted
> - Column encryption: Please review
> - Parquet-format release: Scope accepted
> - C++ mono-repo: Please vote
>
>
>
> Aligning page row boundaries between different columns (Gabor)
> --
>
> Background: In the existing specification of column indexes, page
> boundaries are not aligned between different column in respect to row
> count.
>
> Gabor: implemented this logic, interested parties can review the code here:
> - https://github.com/apache/parquet-mr/pull/509
> - https://github.com/apache/parquet-mr/commits/column-indexes
>
> Main takeaway from implementation:
>
> - Index filtering logic as currently specified is overcomplicated.
> - May become a maintenance burden and results in steep learning curve
> for onboarding - new developers.
> - Can not be made transparent, vectorized readers (Hive, Spark) have
> to implement a similar logic.
>
> Suggestion:
>
> - Align page row boundaries between different columns, i.e. the n-th
> page of every column should contain the same number of rows.
> - Filtering logic would be a lot simpler.
> - Vectorized readers will get index-based filtering without any change
> required on their side.
>
> Response:
> - Ryan doesn't recommend it. Performance numbers?
> - Discuss offline or on dev mailing list
> - Timeline for reaching decision? Within a week. (Gabor already has a
> working implementation.)
>
>
>
> Remove Java specific code from parquet-format (Nandor)
> --
>
> Background: Parquet-format contains a few Java classes. Earlier no
> changes were required in these, but this has changed in recent
> features, especially with the new column encryption feature, which
> would add substantial new code.
>
> Suggestion (Nandor): Instead of cluttering parquet-format further with
> java-specific code, move these classes to parquet-mr and deprecate
> them in parquet-format.
>
> What is the motivation behind the status quo? Julien: We may need a
> different Thrift version in the parquet-thrift binding than in the
> parquet files themselves. If we move these classes to parquet-mr, we
> should shade thrift. Additionally, currently a thrift-compiler is only
> needed for parquet-format, not parquet-mr, this will change. Gabor:
> Dockerization may help.
>
> Julien: We could merge the two repos altogether as well. Gabor: This,
> however would move the specification into the Java implementation,
> which would be against the cross-language ideology, so let's keep the
> separate repo for the format. Zoltan: Other language binding should
> also consider directly using it instead of copying parquet.thrift into
> their source code.
>
>
>
> Column encryption (Gidon)
> -
>
> Under development:
> - Key management API (doesn’t provide E2E key management) (PARQUET-1373)
> - Anonymization and data masking (PARQUET-1376)
>
> Java PRs under review:
> - https://github.com/apache/parquet-mr/pull/471
> - https://github.com/apache/parquet-mr/pull/472
>
> C++ PR:
> - https://github.com/apache/parquet-cpp/pull/475
>
>
> We need more testing (both unit tests and interop tests between Java and C++).
>
>
>
> Parquet-format release (Zoltan)
> ---
>
> Suggested scope (Zoltan):
> - Column encryption
> - Nanosec precision
> - Anything else?
>
> Discussion:
> - Nothing else to add.
> - Wes welcomes the nano precision, will be needed in parquet-cpp as well.
>
>
>
> C++ mono-repo: merging Arrow and parquet-cpp (Wes)
> --
>
>
> Background: duplicated CI system and codebase, circular dependencies
> between libraries
>
> Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be
> read here: 
> https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
>
>
> Resolution: No objections but no final decision either, vote on the
> parquet list: 
> https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E


[jira] [Commented] (PARQUET-1389) Improve value skipping at page synchronization

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584107#comment-16584107
 ] 

ASF GitHub Bot commented on PARQUET-1389:
-

gszadovszky opened a new pull request #514: PARQUET-1389: Improve value 
skipping at page synchronization
URL: https://github.com/apache/parquet-mr/pull/514
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve value skipping at page synchronization
> --
>
> Key: PARQUET-1389
> URL: https://issues.apache.org/jira/browse/PARQUET-1389
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, value skipping is done one-by-one for page synchronization. There 
> are encodings (e.g. plain) where several values can be skipped at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1389) Improve value skipping at page synchronization

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1389:

Labels: pull-request-available  (was: )

> Improve value skipping at page synchronization
> --
>
> Key: PARQUET-1389
> URL: https://issues.apache.org/jira/browse/PARQUET-1389
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, value skipping is done one-by-one for page synchronization. There 
> are encodings (e.g. plain) where several values can be skipped at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1310) Column indexes: Filtering

2018-08-17 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1310.
---
Resolution: Fixed

> Column indexes: Filtering
> -
>
> Key: PARQUET-1310
> URL: https://issues.apache.org/jira/browse/PARQUET-1310
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583997#comment-16583997
 ] 

ASF GitHub Bot commented on PARQUET-1384:
-

cjjnjust closed pull request #488: PARQUET-1384: fix clang build error for 
bloom_filter-test.cc
URL: https://github.com/apache/parquet-cpp/pull/488
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/bloom_filter-test.cc b/src/parquet/bloom_filter-test.cc
index 69583af5..96d2e065 100644
--- a/src/parquet/bloom_filter-test.cc
+++ b/src/parquet/bloom_filter-test.cc
@@ -165,7 +165,7 @@ TEST(CompatibilityTest, TestBloomFilter) {
 
   std::unique_ptr bitset(new uint8_t[size]());
   std::shared_ptr buffer(new Buffer(bitset.get(), size));
-  handle->Read(size, );
+  PARQUET_THROW_NOT_OK(handle->Read(size, ));
 
   InMemoryInputStream source(buffer);
   BlockSplitBloomFilter bloom_filter1 = 
BlockSplitBloomFilter::Deserialize();
@@ -192,10 +192,10 @@ TEST(CompatibilityTest, TestBloomFilter) {
   bloom_filter2.WriteTo();
   std::shared_ptr buffer1 = sink.GetBuffer();
 
-  handle->Seek(0);
-  handle->GetSize();
+  PARQUET_THROW_NOT_OK(handle->Seek(0));
+  PARQUET_THROW_NOT_OK(handle->GetSize());
   std::shared_ptr buffer2;
-  handle->Read(size, );
+  PARQUET_THROW_NOT_OK(handle->Read(size, ));
 
   EXPECT_TRUE((*buffer1).Equals(*buffer2));
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Clang compiler warnings in bloom_filter-test.cc
> -
>
> Key: PARQUET-1384
> URL: https://issues.apache.org/jira/browse/PARQUET-1384
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Junjie Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> {code}
> [69/95] Building CXX object 
> src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o
> ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^
> ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^~ 
>   ~
> ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^
> ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^~   ~
> ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Seek(0);
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->GetSize();
>   ^~~ ~
> ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583996#comment-16583996
 ] 

ASF GitHub Bot commented on PARQUET-1384:
-

cjjnjust opened a new pull request #490: PARQUET-1384: fix clang build error 
for bloom_filter-test.cc
URL: https://github.com/apache/parquet-cpp/pull/490
 
 
   replace https://github.com/apache/parquet-cpp/pull/488


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Clang compiler warnings in bloom_filter-test.cc
> -
>
> Key: PARQUET-1384
> URL: https://issues.apache.org/jira/browse/PARQUET-1384
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Junjie Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> {code}
> [69/95] Building CXX object 
> src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o
> ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^
> ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^~ 
>   ~
> ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^
> ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^~   ~
> ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Seek(0);
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->GetSize();
>   ^~~ ~
> ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583976#comment-16583976
 ] 

ASF GitHub Bot commented on PARQUET-1385:
-

wesm closed pull request #489: PARQUET-1385: Do not run TestBloomFilter.FPPTest 
when valgrind is in use
URL: https://github.com/apache/parquet-cpp/pull/489
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/bloom_filter-test.cc b/src/parquet/bloom_filter-test.cc
index dbef8c8b..dfdac12b 100644
--- a/src/parquet/bloom_filter-test.cc
+++ b/src/parquet/bloom_filter-test.cc
@@ -99,6 +99,11 @@ std::string GetRandomString(uint32_t length) {
   return ret;
 }
 
+#ifndef PARQUET_VALGRIND
+
+// PARQUET-1385(wesm): This test is very slow under valgrind; we omit it in
+// test runs for the sake of Travis CI
+
 TEST(FPPTest, TestBloomFilter) {
   // It counts the number of times FindHash returns true.
   int exist = 0;
@@ -137,6 +142,8 @@ TEST(FPPTest, TestBloomFilter) {
   EXPECT_TRUE(exist < total_count * fpp);
 }
 
+#endif  // PLASMA_VALGRIND
+
 // The CompatibilityTest is used to test cross compatibility with parquet-mr, 
it reads
 // the Bloom filter binary generated by the Bloom filter class in the 
parquet-mr project
 // and tests whether the values inserted before could be filtered or not.
diff --git a/src/parquet/types.h b/src/parquet/types.h
index aec99656..10789cbf 100644
--- a/src/parquet/types.h
+++ b/src/parquet/types.h
@@ -114,13 +114,9 @@ struct Compression {
 };
 
 struct Encryption {
-  enum type {
-AES_GCM_V1 = 0,
-AES_GCM_CTR_V1 = 1
-  };
+  enum type { AES_GCM_V1 = 0, AES_GCM_CTR_V1 = 1 };
 };
 
-
 // parquet::PageType
 struct PageType {
   enum type { DATA_PAGE, INDEX_PAGE, DICTIONARY_PAGE, DATA_PAGE_V2 };


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1382) [C++] Prepare for arrow::test namespace removal

2018-08-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1382.
---
   Resolution: Fixed
Fix Version/s: cpp-1.5.0

Issue resolved by pull request 487
[https://github.com/apache/parquet-cpp/pull/487]

> [C++] Prepare for arrow::test namespace removal
> ---
>
> Key: PARQUET-1382
> URL: https://issues.apache.org/jira/browse/PARQUET-1382
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> ARROW-3059 will remove the {{arrow::test}} namespace, make sure the 
> parquet-cpp codebase doesn't break.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1382) [C++] Prepare for arrow::test namespace removal

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583972#comment-16583972
 ] 

ASF GitHub Bot commented on PARQUET-1382:
-

wesm closed pull request #487: PARQUET-1382: [C++] Prepare for arrow::test 
namespace removal
URL: https://github.com/apache/parquet-cpp/pull/487
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.travis.yml b/.travis.yml
index 7918b890..e1faf68f 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -14,8 +14,15 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+
 sudo: required
 dist: trusty
+
+language: cpp
+
+cache:
+  ccache: true
+
 addons:
   apt:
 sources:
@@ -35,6 +42,7 @@ addons:
 - bison
 - flex
 - pkg-config
+
 matrix:
   fast_finish: true
   include:
@@ -42,10 +50,7 @@ matrix:
 os: linux
 before_script:
 - export PARQUET_CXXFLAGS="-DARROW_NO_DEPRECATED_API"
-- source $TRAVIS_BUILD_DIR/ci/before_script_travis.sh
-  - compiler: gcc
-os: linux
-before_script:
+- export PARQUET_TRAVIS_VALGRIND=1
 - source $TRAVIS_BUILD_DIR/ci/before_script_travis.sh
   - compiler: clang
 os: linux
@@ -76,8 +81,6 @@ matrix:
 script:
 - $TRAVIS_BUILD_DIR/ci/travis_script_toolchain.sh
 
-language: cpp
-
 # PARQUET-626: revisit llvm toolchain when/if llvm.org apt repo resurfaces
 
 # before_install:
diff --git a/ci/before_script_travis.sh b/ci/before_script_travis.sh
index 95a2fd82..ce0234c0 100755
--- a/ci/before_script_travis.sh
+++ b/ci/before_script_travis.sh
@@ -28,15 +28,20 @@ fi
 
 export PARQUET_TEST_DATA=$TRAVIS_BUILD_DIR/data
 
+CMAKE_COMMON_FLAGS="-DPARQUET_BUILD_WARNING_LEVEL=CHECKIN"
+
+if [ $PARQUET_TRAVIS_VALGRIND == "1" ]; then
+  CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -DPARQUET_TEST_MEMCHECK=ON"
+fi
+
 if [ $TRAVIS_OS_NAME == "linux" ]; then
-cmake -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \
-  -DPARQUET_TEST_MEMCHECK=ON \
+cmake $CMAKE_COMMON_FLAGS \
+  -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \
   -DPARQUET_BUILD_BENCHMARKS=ON \
-  -DPARQUET_BUILD_WARNING_LEVEL=CHECKIN \
   -DPARQUET_GENERATE_COVERAGE=1 \
   $TRAVIS_BUILD_DIR
 else
-cmake -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \
-  -DPARQUET_BUILD_WARNING_LEVEL=CHECKIN \
+cmake $CMAKE_COMMON_FLAGS \
+  -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \
   $TRAVIS_BUILD_DIR
 fi
diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat
index 0136819b..7a50c854 100644
--- a/ci/msvc-build.bat
+++ b/ci/msvc-build.bat
@@ -45,8 +45,8 @@ if defined need_vcvarsall (
 
 if "%CONFIGURATION%" == "Toolchain" (
   conda install -y boost-cpp=1.63 thrift-cpp=0.11.0 ^
-  brotli=0.6.0 zlib=1.2.11 snappy=1.1.6 lz4-c=1.7.5 zstd=1.2.0 ^
-  -c conda-forge
+  brotli=1.0.2 zlib=1.2.11 snappy=1.1.7 lz4-c=1.8.0 zstd=1.3.3 ^
+  -c conda-forge || exit /B
 
   set ARROW_BUILD_TOOLCHAIN=%MINICONDA%/Library
   set PARQUET_BUILD_TOOLCHAIN=%MINICONDA%/Library
diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh
index d3cef663..30313634 100755
--- a/ci/travis_script_cpp.sh
+++ b/ci/travis_script_cpp.sh
@@ -33,18 +33,18 @@ make lint
 # fi
 
 if [ $TRAVIS_OS_NAME == "linux" ]; then
-  make -j4 || exit 1
-  ctest -VV -L unittest || { cat 
$TRAVIS_BUILD_DIR/parquet-build/Testing/Temporary/LastTest.log; exit 1; }
+  make -j4
+  ctest -j2 -VV -L unittest
 # Current cpp-coveralls version 0.4 throws an error (PARQUET-1075) on Travis 
CI. Pin to last working version
   sudo pip install cpp_coveralls==0.3.12
   export PARQUET_ROOT=$TRAVIS_BUILD_DIR
   $TRAVIS_BUILD_DIR/ci/upload_coverage.sh
 else
-  make -j4 || exit 1
+  make -j4
   BUILD_TYPE=debug
   EXECUTABLE_DIR=$CPP_BUILD_DIR/$BUILD_TYPE
   export LD_LIBRARY_PATH=$EXECUTABLE_DIR:$LD_LIBRARY_PATH
-  ctest -VV -L unittest || { cat 
$TRAVIS_BUILD_DIR/parquet-build/Testing/Temporary/LastTest.log; exit 1; }
+  ctest -j2 -VV -L unittest
 fi
 
 popd
diff --git a/ci/travis_script_static.sh b/ci/travis_script_static.sh
index b76ced8f..8af574e3 100755
--- a/ci/travis_script_static.sh
+++ b/ci/travis_script_static.sh
@@ -65,8 +65,14 @@ export 
ZLIB_STATIC_LIB=$ARROW_EP/zlib_ep/src/zlib_ep-install/lib/libz.a
 export LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a
 export ZSTD_STATIC_LIB=$ARROW_EP/zstd_ep-prefix/src/zstd_ep/lib/libzstd.a
 
-cmake -DPARQUET_CXXFLAGS="$PARQUET_CXXFLAGS" \
-  -DPARQUET_TEST_MEMCHECK=ON \
+CMAKE_COMMON_FLAGS="-DPARQUET_BUILD_WARNING_LEVEL=CHECKIN"
+
+if [ $PARQUET_TRAVIS_VALGRIND == "1" ]; then
+  CMAKE_COMMON_FLAGS="$CMAKE_COMMON_FLAGS -DPARQUET_TEST_MEMCHECK=ON"
+fi
+
+cmake 

[jira] [Created] (PARQUET-1389) Improve value skipping at page synchronization

2018-08-17 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1389:
-

 Summary: Improve value skipping at page synchronization
 Key: PARQUET-1389
 URL: https://issues.apache.org/jira/browse/PARQUET-1389
 Project: Parquet
  Issue Type: Sub-task
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Currently, value skipping is done one-by-one for page synchronization. There 
are encodings (e.g. plain) where several values can be skipped at once. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Parquet sync meeting minutes

2018-08-17 Thread Nandor Kollar
Topics discussed and decisions (meeting held on 2018 August 15th, at
6pm CET / 9 am PST):

- Aligning page row boundaries between different columns: Debated,
please follow-up
- Remove Java specific code from parquet-format: Accepted
- Column encryption: Please review
- Parquet-format release: Scope accepted
- C++ mono-repo: Please vote



Aligning page row boundaries between different columns (Gabor)
--

Background: In the existing specification of column indexes, page
boundaries are not aligned between different column in respect to row
count.

Gabor: implemented this logic, interested parties can review the code here:
- https://github.com/apache/parquet-mr/pull/509
- https://github.com/apache/parquet-mr/commits/column-indexes

Main takeaway from implementation:

- Index filtering logic as currently specified is overcomplicated.
- May become a maintenance burden and results in steep learning curve
for onboarding - new developers.
- Can not be made transparent, vectorized readers (Hive, Spark) have
to implement a similar logic.

Suggestion:

- Align page row boundaries between different columns, i.e. the n-th
page of every column should contain the same number of rows.
- Filtering logic would be a lot simpler.
- Vectorized readers will get index-based filtering without any change
required on their side.

Response:
- Ryan doesn't recommend it. Performance numbers?
- Discuss offline or on dev mailing list
- Timeline for reaching decision? Within a week. (Gabor already has a
working implementation.)



Remove Java specific code from parquet-format (Nandor)
--

Background: Parquet-format contains a few Java classes. Earlier no
changes were required in these, but this has changed in recent
features, especially with the new column encryption feature, which
would add substantial new code.

Suggestion (Nandor): Instead of cluttering parquet-format further with
java-specific code, move these classes to parquet-mr and deprecate
them in parquet-format.

What is the motivation behind the status quo? Julien: We may need a
different Thrift version in the parquet-thrift binding than in the
parquet files themselves. If we move these classes to parquet-mr, we
should shade thrift. Additionally, currently a thrift-compiler is only
needed for parquet-format, not parquet-mr, this will change. Gabor:
Dockerization may help.

Julien: We could merge the two repos altogether as well. Gabor: This,
however would move the specification into the Java implementation,
which would be against the cross-language ideology, so let's keep the
separate repo for the format. Zoltan: Other language binding should
also consider directly using it instead of copying parquet.thrift into
their source code.



Column encryption (Gidon)
-

Under development:
- Key management API (doesn’t provide E2E key management) (PARQUET-1373)
- Anonymization and data masking (PARQUET-1376)

Java PRs under review:
- https://github.com/apache/parquet-mr/pull/471
- https://github.com/apache/parquet-mr/pull/472

C++ PR:
- https://github.com/apache/parquet-cpp/pull/475


We need more testing (both unit tests and interop tests between Java and C++).



Parquet-format release (Zoltan)
---

Suggested scope (Zoltan):
- Column encryption
- Nanosec precision
- Anything else?

Discussion:
- Nothing else to add.
- Wes welcomes the nano precision, will be needed in parquet-cpp as well.



C++ mono-repo: merging Arrow and parquet-cpp (Wes)
--


Background: duplicated CI system and codebase, circular dependencies
between libraries

Suggestion (Wes): move parquet-cpp into arrow codebase. Details can be
read here: 
https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E


Resolution: No objections but no final decision either, vote on the
parquet list: 
https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E


[jira] [Commented] (PARQUET-1383) Parquet tools should print logical type instead of (or besides) original type

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583826#comment-16583826
 ] 

ASF GitHub Bot commented on PARQUET-1383:
-

nandorKollar opened a new pull request #513: PARQUET-1383: Parquet tools should 
print logical type instead of (or besides) original type
URL: https://github.com/apache/parquet-mr/pull/513
 
 
   This pull request addresses two topics:
   - write logical type in parquet tools meta besides original type
   - take to UTC normalized parameter into account when printing time/timestamp 
values (using stringifiers)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parquet tools should print logical type instead of (or besides) original type
> -
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, parquet-tools should print original type. Since the new logical 
> type API is introduced, it would be better to print it instead of, or besides 
> the original type.
> Also, the values written by the tools should take UTC normalized parameters 
> into account. Right now, every time and timestamp value is adjusted to UTC 
> when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1383) Parquet tools should print logical type instead of (or besides) original type

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1383:

Labels: pull-request-available  (was: )

> Parquet tools should print logical type instead of (or besides) original type
> -
>
> Key: PARQUET-1383
> URL: https://issues.apache.org/jira/browse/PARQUET-1383
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, parquet-tools should print original type. Since the new logical 
> type API is introduced, it would be better to print it instead of, or besides 
> the original type.
> Also, the values written by the tools should take UTC normalized parameters 
> into account. Right now, every time and timestamp value is adjusted to UTC 
> when printed via parquet-tools



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583819#comment-16583819
 ] 

ASF GitHub Bot commented on PARQUET-1387:
-

nandorKollar opened a new pull request #102: PARQUET-1387: Nanosecond precision 
time and timestamp - parquet-format
URL: https://github.com/apache/parquet-format/pull/102
 
 
   Introduce new nanosecond precision in TimeUnit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1387:

Labels: pull-request-available  (was: )

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1388) Nanosecond precision time and timestamp - parquet-mr

2018-08-17 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1388:
--

 Summary: Nanosecond precision time and timestamp - parquet-mr
 Key: PARQUET-1388
 URL: https://issues.apache.org/jira/browse/PARQUET-1388
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1387:
---
Fix Version/s: (was: format-2.6.0)

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1387:
---
Fix Version/s: format-2.6.0

> Nanosecond precision time and timestamp - parquet-format
> 
>
> Key: PARQUET-1387
> URL: https://issues.apache.org/jira/browse/PARQUET-1387
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: format-2.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1387) Nanosecond precision time and timestamp - parquet-format

2018-08-17 Thread Nandor Kollar (JIRA)
Nandor Kollar created PARQUET-1387:
--

 Summary: Nanosecond precision time and timestamp - parquet-format
 Key: PARQUET-1387
 URL: https://issues.apache.org/jira/browse/PARQUET-1387
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Nandor Kollar
Assignee: Nandor Kollar






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1386) Fix issues of NaN and +-0.0 in case of float/double column indexes

2018-08-17 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1386:
-

 Summary: Fix issues of NaN and +-0.0 in case of float/double 
column indexes
 Key: PARQUET-1386
 URL: https://issues.apache.org/jira/browse/PARQUET-1386
 Project: Parquet
  Issue Type: Sub-task
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Workaround the float/double column indexes just like we did for statistics in 
PARQUET-1246.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583489#comment-16583489
 ] 

Junjie Chen commented on PARQUET-1385:
--

std::seed_seq::generate takes more than 75% cpu cycles from perf. we can change 
to use system clock as seed to optimize this to about 1/5 time (on my machine). 
Anyway skip this in case of valgrind is also ok.

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1308) [C++] parquet::arrow should use thread pool, not ParallelFor

2018-08-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1308.
---
   Resolution: Fixed
Fix Version/s: cpp-1.5.0

Issue resolved by pull request 467
[https://github.com/apache/parquet-cpp/pull/467]

> [C++] parquet::arrow should use thread pool, not ParallelFor
> 
>
> Key: PARQUET-1308
> URL: https://issues.apache.org/jira/browse/PARQUET-1308
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> Arrow now has a global thread pool, parquet::arrow should use that instead of 
> ParallelFor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1308) [C++] parquet::arrow should use thread pool, not ParallelFor

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583431#comment-16583431
 ] 

ASF GitHub Bot commented on PARQUET-1308:
-

wesm closed pull request #467: PARQUET-1308: [C++] Use Arrow thread pool, not 
Arrow ParallelFor, fix deprecated APIs, upgrade clang-format version. Fix 
record delimiting bug
URL: https://github.com/apache/parquet-cpp/pull/467
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/benchmarks/decode_benchmark.cc b/benchmarks/decode_benchmark.cc
index 8f2dfa07..3ae32b4c 100644
--- a/benchmarks/decode_benchmark.cc
+++ b/benchmarks/decode_benchmark.cc
@@ -42,8 +42,8 @@ class DeltaBitPackEncoder {
 
   uint8_t* Encode(int* encoded_len) {
 uint8_t* result = new uint8_t[10 * 1024 * 1024];
-int num_mini_blocks = static_cast(arrow::BitUtil::Ceil(num_values() - 
1,
-
mini_block_size_));
+int num_mini_blocks = 
static_cast(arrow::BitUtil::CeilDiv(num_values() - 1,
+   
mini_block_size_));
 uint8_t* mini_block_widths = NULL;
 
 arrow::BitWriter writer(result, 10 * 1024 * 1024);
diff --git a/cmake_modules/ArrowExternalProject.cmake 
b/cmake_modules/ArrowExternalProject.cmake
index 4f23661e..3d1a2760 100644
--- a/cmake_modules/ArrowExternalProject.cmake
+++ b/cmake_modules/ArrowExternalProject.cmake
@@ -46,7 +46,7 @@ if (MSVC AND PARQUET_USE_STATIC_CRT)
 endif()
 
 if ("$ENV{PARQUET_ARROW_VERSION}" STREQUAL "")
-  set(ARROW_VERSION "501d60e918bd4d10c429ab34e0b8e8a87dffb732")
+  set(ARROW_VERSION "3edfd7caf2746eeba37d5ac7bfd3665cc159e7ad")
 else()
   set(ARROW_VERSION "$ENV{PARQUET_ARROW_VERSION}")
 endif()
diff --git a/cmake_modules/FindClangTools.cmake 
b/cmake_modules/FindClangTools.cmake
index 215a5cd9..56e2dd77 100644
--- a/cmake_modules/FindClangTools.cmake
+++ b/cmake_modules/FindClangTools.cmake
@@ -96,7 +96,9 @@ if (CLANG_FORMAT_VERSION)
 endif()
 else()
 find_program(CLANG_FORMAT_BIN
-  NAMES clang-format-4.0
+  NAMES clang-format-6.0
+  clang-format-5.0
+  clang-format-4.0
   clang-format-3.9
   clang-format-3.8
   clang-format-3.7
diff --git a/cmake_modules/SetupCxxFlags.cmake 
b/cmake_modules/SetupCxxFlags.cmake
index 01ed85bf..5ca3f4ef 100644
--- a/cmake_modules/SetupCxxFlags.cmake
+++ b/cmake_modules/SetupCxxFlags.cmake
@@ -84,6 +84,7 @@ if ("${UPPERCASE_BUILD_WARNING_LEVEL}" STREQUAL "CHECKIN")
 -Wno-shadow -Wno-switch-enum -Wno-exit-time-destructors \
 -Wno-global-constructors -Wno-weak-template-vtables 
-Wno-undefined-reinterpret-cast \
 -Wno-implicit-fallthrough -Wno-unreachable-code-return \
+-Wno-documentation-deprecated-sync \
 -Wno-float-equal -Wno-missing-prototypes \
 -Wno-old-style-cast -Wno-covered-switch-default \
 -Wno-format-nonliteral -Wno-missing-noreturn \
diff --git a/src/parquet/arrow/arrow-reader-writer-benchmark.cc 
b/src/parquet/arrow/arrow-reader-writer-benchmark.cc
index 15d2cf72..51eb0c23 100644
--- a/src/parquet/arrow/arrow-reader-writer-benchmark.cc
+++ b/src/parquet/arrow/arrow-reader-writer-benchmark.cc
@@ -104,9 +104,9 @@ std::shared_ptr<::arrow::Table> TableFromVector(
 std::vector valid_bytes(BENCHMARK_SIZE, 0);
 int n = {0};
 std::generate(valid_bytes.begin(), valid_bytes.end(), [] { return n++ % 
2; });
-EXIT_NOT_OK(builder.Append(vec.data(), vec.size(), valid_bytes.data()));
+EXIT_NOT_OK(builder.AppendValues(vec.data(), vec.size(), 
valid_bytes.data()));
   } else {
-EXIT_NOT_OK(builder.Append(vec.data(), vec.size(), nullptr));
+EXIT_NOT_OK(builder.AppendValues(vec.data(), vec.size(), nullptr));
   }
   std::shared_ptr<::arrow::Array> array;
   EXIT_NOT_OK(builder.Finish());
@@ -126,9 +126,9 @@ std::shared_ptr<::arrow::Table> 
TableFromVector(const std::vector array;
   EXIT_NOT_OK(builder.Finish());
diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
index d4f5b000..be3e6114 100644
--- a/src/parquet/arrow/arrow-reader-writer-test.cc
+++ b/src/parquet/arrow/arrow-reader-writer-test.cc
@@ -320,8 +320,7 @@ using ParquetDataType = 
DataType::parquet_enum>;
 template 
 using ParquetWriter = TypedColumnWriter>;
 
-void WriteTableToBuffer(const std::shared_ptr& table, int num_threads,
-int64_t row_group_size,
+void WriteTableToBuffer(const std::shared_ptr& table, int64_t 
row_group_size,
 const std::shared_ptr& 
arrow_properties,
 std::shared_ptr* out) {
   auto sink = std::make_shared();
@@ -399,21 +398,21 @@ void AssertTablesEqual(const Table& expected, const 
Table& 

[jira] [Updated] (PARQUET-1308) [C++] parquet::arrow should use thread pool, not ParallelFor

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1308:

Labels: pull-request-available  (was: )

> [C++] parquet::arrow should use thread pool, not ParallelFor
> 
>
> Key: PARQUET-1308
> URL: https://issues.apache.org/jira/browse/PARQUET-1308
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> Arrow now has a global thread pool, parquet::arrow should use that instead of 
> ParallelFor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583425#comment-16583425
 ] 

Junjie Chen commented on PARQUET-1385:
--

The GetRandomString function is very slow, I can change to test count to 1/10 
size.  Or let me optimize it for a while.



> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1384:

Labels: pull-request-available  (was: )

> [C++] Clang compiler warnings in bloom_filter-test.cc
> -
>
> Key: PARQUET-1384
> URL: https://issues.apache.org/jira/browse/PARQUET-1384
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Junjie Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> {code}
> [69/95] Building CXX object 
> src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o
> ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^
> ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^~ 
>   ~
> ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^
> ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^~   ~
> ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Seek(0);
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->GetSize();
>   ^~~ ~
> ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1385:

Labels: pull-request-available  (was: )

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583419#comment-16583419
 ] 

ASF GitHub Bot commented on PARQUET-1385:
-

wesm opened a new pull request #489: PARQUET-1385: Do not run 
TestBloomFilter.FPPTest when valgrind is in use
URL: https://github.com/apache/parquet-cpp/pull/489
 
 
   This test will still be run in other entries of Travis CI where valgrind is 
not being used


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1384) [C++] Clang compiler warnings in bloom_filter-test.cc

2018-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583417#comment-16583417
 ] 

ASF GitHub Bot commented on PARQUET-1384:
-

cjjnjust opened a new pull request #488: PARQUET-1384: fix clang build error 
for bloom_filter-test.cc
URL: https://github.com/apache/parquet-cpp/pull/488
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Clang compiler warnings in bloom_filter-test.cc
> -
>
> Key: PARQUET-1384
> URL: https://issues.apache.org/jira/browse/PARQUET-1384
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Junjie Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> {code}
> [69/95] Building CXX object 
> src/parquet/CMakeFiles/bloom_filter-test.dir/bloom_filter-test.cc.o
> ../src/parquet/bloom_filter-test.cc:75:36: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^
> ../src/parquet/bloom_filter-test.cc:75:36: note: remove std::move call here
>   BlockSplitBloomFilter de_bloom = 
> std::move(BlockSplitBloomFilter::Deserialize());
>^~ 
>   ~
> ../src/parquet/bloom_filter-test.cc:168:7: warning: moving a temporary object 
> prevents copy elision [-Wpessimizing-move]
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^
> ../src/parquet/bloom_filter-test.cc:168:7: note: remove std::move call here
>   std::move(BlockSplitBloomFilter::Deserialize());
>   ^~   ~
> ../src/parquet/bloom_filter-test.cc:164:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:192:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Seek(0);
>   ^~~~ ~
> ../src/parquet/bloom_filter-test.cc:193:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->GetSize();
>   ^~~ ~
> ../src/parquet/bloom_filter-test.cc:195:3: warning: ignoring return value of 
> function declared with 'warn_unused_result' attribute [-Wunused-result]
>   handle->Read(size, );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: Wes McKinney

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2018-08-17 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-1385:
-

 Summary: [C++] bloom_filter-test is very slow under valgrind
 Key: PARQUET-1385
 URL: https://issues.apache.org/jira/browse/PARQUET-1385
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.5.0


This test takes ~5 minutes to run under valgrind in Travis CI

{code}
1: [==] Running 6 tests from 6 test cases.
1: [--] Global test environment set-up.
1: [--] 1 test from Murmur3Test
1: [ RUN  ] Murmur3Test.TestBloomFilter
1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
1: [--] 1 test from Murmur3Test (34 ms total)
1: 
1: [--] 1 test from ConstructorTest
1: [ RUN  ] ConstructorTest.TestBloomFilter
1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
1: [--] 1 test from ConstructorTest (101 ms total)
1: 
1: [--] 1 test from BasicTest
1: [ RUN  ] BasicTest.TestBloomFilter
1: [   OK ] BasicTest.TestBloomFilter (49 ms)
1: [--] 1 test from BasicTest (49 ms total)
1: 
1: [--] 1 test from FPPTest
1: [ RUN  ] FPPTest.TestBloomFilter
1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
1: [--] 1 test from FPPTest (308741 ms total)
1: 
1: [--] 1 test from CompatibilityTest
1: [ RUN  ] CompatibilityTest.TestBloomFilter
1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
1: [--] 1 test from CompatibilityTest (62 ms total)
1: 
1: [--] 1 test from OptimalValueTest
1: [ RUN  ] OptimalValueTest.TestBloomFilter
1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
1: [--] 1 test from OptimalValueTest (27 ms total)
1: 
1: [--] Global test environment tear-down
1: [==] 6 tests from 6 test cases ran. (309081 ms total)
1: [  PASSED  ] 6 tests.
{code}

Either we should change the FPPTest parameters to be faster, or we should not 
run that test when using valrind



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1380) [C++] move Bloom filter test binary to parquet-testing repo

2018-08-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1380:
--
Summary: [C++] move Bloom filter test binary to parquet-testing repo  (was: 
move Bloom filter test binary to parquet-testing repo)

> [C++] move Bloom filter test binary to parquet-testing repo
> ---
>
> Key: PARQUET-1380
> URL: https://issues.apache.org/jira/browse/PARQUET-1380
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-cpp
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
> Fix For: cpp-1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)