[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688257#comment-17688257
 ] 

ASF GitHub Bot commented on PARQUET-2242:
-

xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1105253278


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Because we use the 1.10 version,and I found that this issue has been fixed 
in [PARQUET-1920](https://issues.apache.org/jira/browse/PARQUET-1920).Thanks 
for your review and this pr looks like can be closed.





> record count for  row group size check configurable
> ---
>
> Key: PARQUET-2242
> URL: https://issues.apache.org/jira/browse/PARQUET-2242
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: xjlem
>Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory 
> size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
> nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
> recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + 
> (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look 
> more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, 
> recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code,if the block size is small ,for example 8M,and the first 100 
> lines record size is small and  after 100 lines the record size is big,it 
> will cause big row group,in our real scene,it will more than 64M. So i think 
> the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] xjlem commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable

2023-02-13 Thread via GitHub


xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1105253278


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Because we use the 1.10 version,and I found that this issue has been fixed 
in [PARQUET-1920](https://issues.apache.org/jira/browse/PARQUET-1920).Thanks 
for your review and this pr looks like can be closed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] xjlem closed pull request #1024: PARQUET-2242:record count for row group size check configurable

2023-02-13 Thread via GitHub


xjlem closed pull request #1024: PARQUET-2242:record count for row group size 
check configurable
URL: https://github.com/apache/parquet-mr/pull/1024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688231#comment-17688231
 ] 

ASF GitHub Bot commented on PARQUET-2241:
-

wgtmac commented on code in PR #1025:
URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1105198239


##
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java:
##
@@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] 
gatheredData)
   public void initFromPage(int valuesCount, ByteBufferInputStream stream) 
throws ParquetDecodingException, IOException {

Review Comment:
   Added a check to use `valuesCount` as upper bound.





> ByteStreamSplitDecoder broken in presence of nulls
> --
>
> Key: PARQUET-2241
> URL: https://issues.apache.org/jira/browse/PARQUET-2241
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Affects Versions: format-2.8.0
>Reporter: Xuwei Fu
>Assignee: Gang Wu
>Priority: Major
>
>  
> This problem is shown in this issue: 
> [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
> Let me talk about it briefly:
> * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, 
> but using "num_values" as stride in BYTE_STREAM_SPLIT
> * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
> the page, however, in DATA_PAGE_V1, without statistics, we should read 
> def-levels and rep-levels to get the real num-of-values. And without the 
> num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
>  
> The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1025: PARQUET-2241: Fix ByteStreamSplitValuesReader with nulls

2023-02-13 Thread via GitHub


wgtmac commented on code in PR #1025:
URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1105198239


##
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java:
##
@@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] 
gatheredData)
   public void initFromPage(int valuesCount, ByteBufferInputStream stream) 
throws ParquetDecodingException, IOException {

Review Comment:
   Added a check to use `valuesCount` as upper bound.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688227#comment-17688227
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105193381


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {

Review Comment:
   It is tricky to set a max size, whether in terms of number of files or total 
file sizes. Based on my knowledge, the performance varies when input files have 
different number of rows, number of columns, number of blocks, block sizes, 
etc. So it would be better for the caller to do the planning.





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105193381


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {

Review Comment:
   It is tricky to set a max size, whether in terms of number of files or total 
file sizes. Based on my knowledge, the performance varies when input files have 
different number of rows, number of columns, number of blocks, block sizes, 
etc. So it would be better for the caller to do the planning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688224#comment-17688224
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105191438


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   > Do we do dedup?
   
   Yes, `allOriginalCreatedBys` is a HashSet which does the job. 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105191438


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   > Do we do dedup?
   
   Yes, `allOriginalCreatedBys` is a HashSet which does the job. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688043#comment-17688043
 ] 

ASF GitHub Bot commented on PARQUET-758:


julienledem commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1428363226

   @emkornfield apologies for this, I realize I replied to a thread on the 
private list. @pitrou emailed private@ to get more eyes on this. Which worked. 
But yes, this discussion didn't need to be private.
   I replied: "I would recommend having the corresponding PRs for support in 
the Java and C++ implementation ready as well for this to be merged so that we 
don't release a new type that does not have support in the implementation."
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] julienledem commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-02-13 Thread via GitHub


julienledem commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1428363226

   @emkornfield apologies for this, I realize I replied to a thread on the 
private list. @pitrou emailed private@ to get more eyes on this. Which worked. 
But yes, this discussion didn't need to be private.
   I replied: "I would recommend having the corresponding PRs for support in 
the Java and C++ implementation ready as well for this to be merged so that we 
don't release a new type that does not have support in the implementation."
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688030#comment-17688030
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

shangxinli commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104769279


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {

Review Comment:
   Do we want to set a max size?





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


shangxinli commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104769279


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {

Review Comment:
   Do we want to set a max size?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688029#comment-17688029
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

shangxinli commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   Do we do dedup? 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


shangxinli commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   Do we do dedup? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688027#comment-17688027
 ] 

ASF GitHub Bot commented on PARQUET-2241:
-

shangxinli commented on code in PR #1025:
URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1104758035


##
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java:
##
@@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] 
gatheredData)
   public void initFromPage(int valuesCount, ByteBufferInputStream stream) 
throws ParquetDecodingException, IOException {

Review Comment:
   It seems the input value valueCount is not used anymore





> ByteStreamSplitDecoder broken in presence of nulls
> --
>
> Key: PARQUET-2241
> URL: https://issues.apache.org/jira/browse/PARQUET-2241
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Affects Versions: format-2.8.0
>Reporter: Xuwei Fu
>Assignee: Gang Wu
>Priority: Major
>
>  
> This problem is shown in this issue: 
> [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
> Let me talk about it briefly:
> * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, 
> but using "num_values" as stride in BYTE_STREAM_SPLIT
> * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
> the page, however, in DATA_PAGE_V1, without statistics, we should read 
> def-levels and rep-levels to get the real num-of-values. And without the 
> num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
>  
> The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1025: PARQUET-2241: Fix ByteStreamSplitValuesReader with nulls

2023-02-13 Thread via GitHub


shangxinli commented on code in PR #1025:
URL: https://github.com/apache/parquet-mr/pull/1025#discussion_r1104758035


##
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java:
##
@@ -58,15 +58,19 @@ protected void gatherElementDataFromStreams(byte[] 
gatheredData)
   public void initFromPage(int valuesCount, ByteBufferInputStream stream) 
throws ParquetDecodingException, IOException {

Review Comment:
   It seems the input value valueCount is not used anymore



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688001#comment-17688001
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1428175723

   @ggershinsky @shangxinli PTAL, thanks!




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1428175723

   @ggershinsky @shangxinli PTAL, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687987#comment-17687987
 ] 

ASF GitHub Bot commented on PARQUET-2242:
-

wgtmac commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104614696


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Why fixing this on an old branch but not on the master?





> record count for  row group size check configurable
> ---
>
> Key: PARQUET-2242
> URL: https://issues.apache.org/jira/browse/PARQUET-2242
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: xjlem
>Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory 
> size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
> nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
> recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + 
> (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look 
> more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, 
> recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code,if the block size is small ,for example 8M,and the first 100 
> lines record size is small and  after 100 lines the record size is big,it 
> will cause big row group,in our real scene,it will more than 64M. So i think 
> the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable

2023-02-13 Thread via GitHub


wgtmac commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104614696


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   Why fixing this on an old branch but not on the master?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687985#comment-17687985
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104610792


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   We have discussed the issue to consolidate original created_by when merging 
several input files. Now I use a HashSet to make sure there is no duplicate and 
they are concatenated by `\n`. Does this sound good? @gszadovszky  





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104610792


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   We have discussed the issue to consolidate original created_by when merging 
several input files. Now I use a HashSet to make sure there is no duplicate and 
they are concatenated by `\n`. Does this sound good? @gszadovszky  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687977#comment-17687977
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac opened a new pull request, #1026:
URL: https://github.com/apache/parquet-mr/pull/1026

   ### Jira
   
   https://issues.apache.org/jira/browse/PARQUET-2228
   
   ### Tests
   
   - Refactor and add various cases to `ParquetRewriterTest` for merging files.
   
   ### Commits
   
   - RewriteOptions supports merging more than one input file.
   - Enforce schema equality of all input files.
   - ParquetRewriter supports various rewrite operations on several input files.




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac opened a new pull request, #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-13 Thread via GitHub


wgtmac opened a new pull request, #1026:
URL: https://github.com/apache/parquet-mr/pull/1026

   ### Jira
   
   https://issues.apache.org/jira/browse/PARQUET-2228
   
   ### Tests
   
   - Refactor and add various cases to `ParquetRewriterTest` for merging files.
   
   ### Commits
   
   - RewriteOptions supports merging more than one input file.
   - Enforce schema equality of all input files.
   - ParquetRewriter supports various rewrite operations on several input files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687886#comment-17687886
 ] 

ASF GitHub Bot commented on PARQUET-758:


pitrou commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1427745167

   > It might have missed it but I didn't see Julien's reply on the dev mailing 
list. This seems reasonable though.
   
   For full disclosure, it was a discussion involving the Parquet PMC and 
myself after I messaged the PMC to inquire about the way forward on this 
proposal.
   




> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] pitrou commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-02-13 Thread via GitHub


pitrou commented on PR #184:
URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1427745167

   > It might have missed it but I didn't see Julien's reply on the dev mailing 
list. This seems reasonable though.
   
   For full disclosure, it was a discussion involving the Parquet PMC and 
myself after I messaged the PMC to inquire about the way forward on this 
proposal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687809#comment-17687809
 ] 

ASF GitHub Bot commented on PARQUET-2242:
-

xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104154913


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##
@@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException {
 LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
nextRowGroupSize, recordCount);
 flushRowGroupToStore();
 initStore();
-recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
+recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, 
recordCount / 2), maxRowCountForBlockSizeCheck);

Review Comment:
   Yes,it's like the config 
'parquet.page.size.row.check.min'、'parquet.page.size.row.check.max'.
   The rowgroup check algorithm is like the page check algorithm with config 
set ‘parquet.page.size.check.estimate’ true .





> record count for  row group size check configurable
> ---
>
> Key: PARQUET-2242
> URL: https://issues.apache.org/jira/browse/PARQUET-2242
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: xjlem
>Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory 
> size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
> nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
> recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + 
> (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look 
> more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, 
> recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code,if the block size is small ,for example 8M,and the first 100 
> lines record size is small and  after 100 lines the record size is big,it 
> will cause big row group,in our real scene,it will more than 64M. So i think 
> the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] xjlem commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable

2023-02-13 Thread via GitHub


xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104154913


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java:
##
@@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException {
 LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
nextRowGroupSize, recordCount);
 flushRowGroupToStore();
 initStore();
-recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
+recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, 
recordCount / 2), maxRowCountForBlockSizeCheck);

Review Comment:
   Yes,it's like the config 
'parquet.page.size.row.check.min'、'parquet.page.size.row.check.max'.
   The rowgroup check algorithm is like the page check algorithm with config 
set ‘parquet.page.size.check.estimate’ true .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2242) record count for row group size check configurable

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687805#comment-17687805
 ] 

ASF GitHub Bot commented on PARQUET-2242:
-

xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104147184


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   
[apache:parquet-1.10.x](https://github.com/apache/parquet-mr/tree/parquet-1.10.x)
  doesn't has README.md in the parquet-hadoop directory and now I  add 
README.md in this branch





> record count for  row group size check configurable
> ---
>
> Key: PARQUET-2242
> URL: https://issues.apache.org/jira/browse/PARQUET-2242
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: xjlem
>Priority: Major
>
>  org.apache.parquet.hadoop.InternalParquetRecordWriter#checkBlockSizeReached
> {code:java}
>  private void checkBlockSizeReached() throws IOException {
>     if (recordCount >= recordCountForNextMemCheck) { // checking the memory 
> size is relatively expensive, so let's not do it for every record.
>       long memSize = columnStore.getBufferedSize();
>       long recordSize = memSize / recordCount;
>       // flush the row group if it is within ~2 records of the limit
>       // it is much better to be slightly under size than to be over at all
>       if (memSize > (nextRowGroupSize - 2 * recordSize)) {
>         LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, 
> nextRowGroupSize, recordCount);
>         flushRowGroupToStore();
>         initStore();
>         recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, 
> recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
>         this.lastRowGroupEndPos = parquetFileWriter.getPos();
>       } else {
>         recordCountForNextMemCheck = min(
>             max(MINIMUM_RECORD_COUNT_FOR_CHECK, (recordCount + 
> (long)(nextRowGroupSize / ((float)recordSize))) / 2), // will check halfway
>             recordCount + MAXIMUM_RECORD_COUNT_FOR_CHECK // will not look 
> more than max records ahead
>             );
>         LOG.debug("Checked mem at {} will check again at: {}", recordCount, 
> recordCountForNextMemCheck);
>       }
>     }
>   } {code}
> in this code,if the block size is small ,for example 8M,and the first 100 
> lines record size is small and  after 100 lines the record size is big,it 
> will cause big row group,in our real scene,it will more than 64M. So i think 
> the size for block check can configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] xjlem commented on a diff in pull request #1024: PARQUET-2242:record count for row group size check configurable

2023-02-13 Thread via GitHub


xjlem commented on code in PR #1024:
URL: https://github.com/apache/parquet-mr/pull/1024#discussion_r1104147184


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##
@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
   public static final String MAX_PADDING_BYTES= 
"parquet.writer.max-padding";
   public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.min";
   public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = 
"parquet.page.size.row.check.max";
+  public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = 
"parquet.block.size.row.check.min";

Review Comment:
   
[apache:parquet-1.10.x](https://github.com/apache/parquet-mr/tree/parquet-1.10.x)
  doesn't has README.md in the parquet-hadoop directory and now I  add 
README.md in this branch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org