zabetak commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1441842189

   I am rather positive on the idea of enforcing limits on the writer but I 
would expect this to be on the protobuf layer 
(https://github.com/protocolbuffers/protobuf/issues/11729). The limitation 
comes from protobuf so it seems natural to add checks there and not in ORC code.
   
   The reason that I brought up the question about maximum size is because as 
the file increases so does the metadata and clearly here we have a hard 
limitation on till where it can go. If there is a compelling use-case to 
support arbitrary big files (with arbitrary big metadata) then investing on a 
new design would make sense.
   
   To be clear, I am not pushing for a new design since I fully agree with both 
@deshanxiao and @omalley in that with proper schema design + configuration the 
chances of hitting the problem are rather small. From the ORC perspective, it 
seems acceptable to settle with the 1/2GB limit on the metadata section.
   
   I didn't mean to imply that having 500GB is a good/normal thing but if 
nothing prevents user to do it, eventually they will get there. :)
   
   Speaking about actual use-cases, in Hive, I have recently seen `OrcSplit` 
reporting a 
[fileLength](https://github.com/apache/hive/blob/d079dfbdda61f6e24fa39ecbcfb758f0a7402cf3/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java#L69)
 of `216488139040` for files under certain table partitions. I am not sure how 
well this number translates to the actual file size nor about the actual 
configuration that led to this situation since I am not the end-user myself; I 
was just investigating the problem by checking the Hive application logs.
   
   Summing up, I don't think a new metadata design is worth it at the moment 
and limiting the writer seems more appropriate to be done in the protobuf 
layer. From my perspective, the only actionable item regarding this issue at 
this point would be to add a brief mention about the metadata size limitation 
on the website (e.g., https://orc.apache.org/specification/ORCv1/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to