[jira] [Commented] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697670#comment-17697670
 ] 

ASF GitHub Bot commented on PARQUET-2202:
-

wgtmac commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1128836441


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);
   this.bloomFilterNDVs.withValue(columnPath, ndv);
   // Setting an NDV for a column implies writing a bloom filter
   this.bloomFilterEnabled.withValue(columnPath, true);
   return this;
 }
 
 public Builder withBloomFilterFPP(String columnPath, double fpp) {
-  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %d", columnPath, fpp);
+  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %s", columnPath, fpp);

Review Comment:
   Is `%f` better for `fpp`?



##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);

Review Comment:
   Why use `%s` for long?





> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.png|width=560!
> Allocation profile:
> !profile-alloc.png|width=560!
> The reason is a {{checkArgument()}} call with an unconditionally constructed 
> dynamic String:
> [https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L303]
> The suggested fix is to move String construction under the condition:
> {code:java}
> if (index >= bytesUsed) {
>   throw new IllegalArgumentException("Index: " + index +
>   " is >= the current size of: " + bytesUsed);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1035: PARQUET-2202: Review usage and implementation of Preconditions.checkargument method

2023-03-07 Thread via GitHub


wgtmac commented on code in PR #1035:
URL: https://github.com/apache/parquet-mr/pull/1035#discussion_r1128836441


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);
   this.bloomFilterNDVs.withValue(columnPath, ndv);
   // Setting an NDV for a column implies writing a bloom filter
   this.bloomFilterEnabled.withValue(columnPath, true);
   return this;
 }
 
 public Builder withBloomFilterFPP(String columnPath, double fpp) {
-  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %d", columnPath, fpp);
+  Preconditions.checkArgument(fpp > 0.0 && fpp < 1.0, "Invalid FPP for 
column \"%s\": %s", columnPath, fpp);

Review Comment:
   Is `%f` better for `fpp`?



##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -477,15 +477,15 @@ public Builder withMaxBloomFilterBytes(int 
maxBloomFilterBytes) {
  * @return this builder for method chaining
  */
 public Builder withBloomFilterNDV(String columnPath, long ndv) {
-  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%d", columnPath, ndv);
+  Preconditions.checkArgument(ndv > 0, "Invalid NDV for column \"%s\": 
%s", columnPath, ndv);

Review Comment:
   Why use `%s` for long?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-03-07 Thread via GitHub


wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128833713


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   That sounds reasonable. Would you like to create a PR with your proposed 
change? @vectorijk 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697669#comment-17697669
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128833713


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   That sounds reasonable. Would you like to create a PR with your proposed 
change? @vectorijk 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697647#comment-17697647
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader` shall we also consider to close previous reader 
explicitly if it is not null? try-with-resources statement to close reader 
might not be suitable in this function



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader`, shall we also consider to close previous reader 
explicitly if it is not null? try-with-resources statement to close reader 
might not be suitable in this function





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] vectorijk commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-03-07 Thread via GitHub


vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader` shall we also consider to close previous reader 
explicitly if it is not null? try-with-resources statement to close reader 
might not be suitable in this function



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader`, shall we also consider to close previous reader 
explicitly if it is not null? try-with-resources statement to close reader 
might not be suitable in this function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] vectorijk commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-03-07 Thread via GitHub


vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader` shall we also consider to close previous reader 
explicitly if it is not null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697646#comment-17697646
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader` shall we also consider to close previous reader 
explicitly if it is not null





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697515#comment-17697515
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-

yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1458436214

   > Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   > 
   > Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?
   
   It is a good idea to adjust filter order and prefer the use of lighter 
filters first to judge.
   But I have some concern (not sure if it is correct): 
   In parquet dictionary will generate only in low-base data( see 
`parquet.dictionary.page.size` 1 MB), and BloomFilter is usually used in high 
base columns(?) . So ideally only one of these two will be used(?)
   
   And ideally we should only use one of these two (don't judge both of them). 
If there is a BloomFilter and filter is `=` or `in`, only use the BloomFilter , 
otherwise use the dictionary.
   




> Improve performance when filters in RowGroupFilter can match exactly
> 
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] yabola commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

2023-03-07 Thread via GitHub


yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1458436214

   > Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   > 
   > Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?
   
   It is a good idea to adjust filter order and prefer the use of lighter 
filters first to judge.
   But I have some concern (not sure if it is correct): 
   In parquet dictionary will generate only in low-base data( see 
`parquet.dictionary.page.size` 1 MB), and BloomFilter is usually used in high 
base columns(?) . So ideally only one of these two will be used(?)
   
   And ideally we should only use one of these two (don't judge both of them). 
If there is a BloomFilter and filter is `=` or `in`, only use the BloomFilter , 
otherwise use the dictionary.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697510#comment-17697510
 ] 

Gabor Szadovszky commented on PARQUET-2254:
---

1) I think, for creating bloom filters we have the statistics to decide how 
much space the bloom filter shall occupy (we have the actual data). What we 
don't know if the bloom filter in itself will be useful or not. (Whould there 
be filtering on the related column and would it be Eq/NotEq/IsIn etc. like 
predicates.) This one shall be decided by the client by the already introduced 
properties. We do not write bloom filters by default anyway.
2) Of course it is hard to be smart for PPD since we don't know the actual data 
(we are just before reading it). But there is an actual order of checking the 
row group filters: statistics, dictionary, bloom filter. Checking the 
statistics first is obviously correct. What I am not sure about is if we want 
to check dictionary first and then the bloom filter or the other way around. 
Because of that question I am also unsure if it is a good practice to not write 
bloom filters if the whole column chunk is dictionary encoded.

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gang Wu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697463#comment-17697463
 ] 

Gang Wu commented on PARQUET-2254:
--

Here are two questions: 1) creating bloom filters without explicit parameters, 
and 2) deciding which levels of filters to use for PPD. Both of them require 
additional input statistics like what is the data distribution of that column 
and what is the filter effectiveness in the past. Therefore I think parquet 
itself does not have to be that smart because it does not have those 
statistics. User would leverage those statistics from somewhere and config 
parquet writer to create bloom filters and decide which level of filters to use 
for PPD. WDYT? [~gszadovszky] [~miracle] 

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697455#comment-17697455
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-

wgtmac commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1458274839

   > Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   > 
   > Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?
   
   What I did in production is to issue async I/Os of dictionaries (if all data 
pages are dictionary-encoded in that column chunk and the dictionary is not 
big) and bloom filters of selected row groups in advance. The reason is to 
eliminate blocking I/O when pushing down the predicates. However, the parquet 
specs only records the offset to bloom filter. So I also added the length of 
each bloom filter in the key value metadata section (probably a good reason to 
add to the specs as well?)




> Improve performance when filters in RowGroupFilter can match exactly
> 
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

2023-03-07 Thread via GitHub


wgtmac commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1458274839

   > Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   > 
   > Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?
   
   What I did in production is to issue async I/Os of dictionaries (if all data 
pages are dictionary-encoded in that column chunk and the dictionary is not 
big) and bloom filters of selected row groups in advance. The reason is to 
eliminate blocking I/O when pushing down the predicates. However, the parquet 
specs only records the offset to bloom filter. So I also added the length of 
each bloom filter in the key value metadata section (probably a good reason to 
add to the specs as well?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697301#comment-17697301
 ] 

Gabor Szadovszky commented on PARQUET-2254:
---

I think this is a good idea. Meanwhile, it would increase the memory footprint 
of the writer. However, if you plan to keep the current logic that the user 
decides the columns which bloom filters are generated for, it should be 
acceptable.
However, I think, we need to take one step back and investigate/synchronize the 
efforts around row group filtering. Or maybe it is only me for whom the 
following questions are not obvious? :)
* Is it always true that reading the dictionary for filtering is cheaper than 
reading the bloom filter? Bloom filters should be usually smaller than 
dictionaries and faster to be scanned for a value.
* Based on the previous one if we decide that it might worth reading the bloom 
filter before the dictionary it also questions the logic of not writing bloom 
filters in case of the whole column chunk is dictionary encoded.
* Meanwhile, if the whole column chunk is dictionary encoded but the dictionary 
is still small (the redundancy is high) then it might not worth writing a bloom 
filter since checking the dictionary might be cheaper.
What do you think?

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2254:
-

Assignee: Mars

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697285#comment-17697285
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-

gszadovszky commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1457722446

   Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   
   Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?




> Improve performance when filters in RowGroupFilter can match exactly
> 
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] gszadovszky commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

2023-03-07 Thread via GitHub


gszadovszky commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1457722446

   Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   
   Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org