[jira] [Commented] (PARQUET-2134) Incorrect type checking in HadoopStreams.wrap

2022-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511570#comment-17511570
 ] 

ASF GitHub Bot commented on PARQUET-2134:
-

7c00 commented on a change in pull request #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r833885795



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java
##
@@ -66,6 +67,15 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 }
   }
 
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  return isWrappedStreamByteBufferReadable(((FSDataInputStream) wrapped));

Review comment:
   Good suggestions! I have updated as the comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Incorrect type checking in HadoopStreams.wrap
> -
>
> Key: PARQUET-2134
> URL: https://issues.apache.org/jira/browse/PARQUET-2134
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.3, 1.10.1, 1.11.2, 1.12.2
>Reporter: Todd Gao
>Priority: Minor
>
> The method 
> [HadoopStreams.wrap|https://github.com/apache/parquet-mr/blob/4d062dc37577e719dcecc666f8e837843e44a9be/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L51]
>  wraps an FSDataInputStream to a SeekableInputStream. 
> It checks whether the underlying stream of the passed  FSDataInputStream 
> implements ByteBufferReadable: if true, wraps the FSDataInputStream to 
> H2SeekableInputStream; otherwise, wraps to H1SeekableInputStream.
> In some cases, we may add another wrapper over FSDataInputStream. For 
> example, 
> {code:java}
> class CustomDataInputStream extends FSDataInputStream {
> public CustomDataInputStream(FSDataInputStream original) {
> super(original);
> }
> }
> {code}
> When we create an FSDataInputStream, whose underlying stream does not 
> implements ByteBufferReadable, and then creates a CustomDataInputStream with 
> it. If we use HadoopStreams.wrap to create a SeekableInputStream, we may get 
> an error like 
> {quote}java.lang.UnsupportedOperationException: Byte-buffer read unsupported 
> by input stream{quote}
> We can fix this by taking recursive checks over the underlying stream of 
> FSDataInputStream.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] 7c00 commented on a change in pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-03-23 Thread GitBox


7c00 commented on a change in pull request #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r833885795



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java
##
@@ -66,6 +67,15 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 }
   }
 
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  return isWrappedStreamByteBufferReadable(((FSDataInputStream) wrapped));

Review comment:
   Good suggestions! I have updated as the comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Parquet sync meeting notes 3/23/2022

2022-03-23 Thread Xinli shang
Attendee (Jorge from Munin Data),   Gidon, Huaxin, Vinoo, Xinli



   1.

   Cell level encryption
   1.

  Formal design
  

  is sent out
  2.

  We choose the 2nd option - splitting columns because it doesn’t need
  specification change
  3.

  Implementation is going on
  4.

  Create a feature branch for review
  2.

   Column resolution by ID (pr
   )
   1.

  The ‘field_id’ in the schema is used.
  2.

  Uniqueness might not guarantee to be used by column resolution with
  ID. We might need a place to remember a flag that this Parquet file is
  column id resolvable
  3.

  Concat tool might be a problem
  4.

  Need some help from Iceberg
  3.

   Parquet writer for Iceberg (Adding a new constructor)
   1.

  A diff will be sent out soon
  4.

   New website (link )
   1.

  Looks good, will make it formal


-- 
Xinli Shang
VP Apache Parquet PMC Chair, Tech Lead Manager at Uber Data Infra


[GitHub] [parquet-site] vinooganesh opened a new pull request #15: Add Release docs and new GCS engine id

2022-03-23 Thread GitBox


vinooganesh opened a new pull request #15:
URL: https://github.com/apache/parquet-site/pull/15


   Add documentation on how to release and new GCS engine id


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511344#comment-17511344
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

shangxinli commented on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1076485916


   > hi @huaxingao , can you describe the lifecycle of the column IDs at a high 
level, either in the PR description, or in a comment? Where these IDs are 
stored (if in footer - which struct/field)? How are they set and written? Is 
the writer app expected to verify the uniqueness, or it can use this PR code 
for that? How the column IDs are read and used (is the reader app expected to 
do anything beyond using this PR code)? I think the answer to the last question 
is mostly provided, but it doesn't explicitly say what IDs are used (where they 
are stored / read from).
   
   +1, I think we can add it to the design doc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli commented on pull request #950: PARQUET-2006: Column resolution by ID

2022-03-23 Thread GitBox


shangxinli commented on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1076485916


   > hi @huaxingao , can you describe the lifecycle of the column IDs at a high 
level, either in the PR description, or in a comment? Where these IDs are 
stored (if in footer - which struct/field)? How are they set and written? Is 
the writer app expected to verify the uniqueness, or it can use this PR code 
for that? How the column IDs are read and used (is the reader app expected to 
do anything beyond using this PR code)? I think the answer to the last question 
is mostly provided, but it doesn't explicitly say what IDs are used (where they 
are stored / read from).
   
   +1, I think we can add it to the design doc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org