shunping-google commented on PR #28728:
URL: https://github.com/apache/beam/pull/28728#issuecomment-1787219466

   [gudfhr95](https://github.com/gudfhr95),
   
   Thanks again for submitting your PRs for review! I spent some time 
discussing with the previous reviewer (@damondouglas) about this PR 
(https://github.com/apache/beam/pull/28728) offline and below is our opinion.
   
   While there are merits in your changes, we also see certain problems that we 
will have to address:
   * It modifies the signature of the public function readFullyAsBytes(), which 
will inevitably break the backward compatibility of existing code. This is 
something we want to avoid.
   * It may not work on large files since it has to put all file content in the 
memory.
   * It only considers "\n" as the line delimiter, while there could be more 
out there.
   
   In fact, I think FileIO is not the best place for this change. In the design 
of Beam IO, FileIO should be mainly focusing on getting a readable/writable 
channel ready. It should not have any assumption on what's inside the file. If 
you put the header skipping feature in FileIO, you are assuming it is a 
text-like file that has line breaks. 
   
   We prefer putting the change in a class derived from FileIO. Just like your 
original PR (https://github.com/apache/beam/pull/28502), TextIO will be the 
best choice as it is for text files and it has already handled different types 
of delimiters. However, this PR also introduced some breaking changes to the 
public functions. Moreover, it will not scale well because it needs to disable 
isSplittable() when the header skipping feature is used.
   
   In summary, we see problems in both PRs and it will be great if we can 
   
   1. implement the changes in TextIO,
   
   1. keep the capability of parallel processing for a large file, and
   
   1. make sure there no breaking changes (instead of changing the signature of 
a public function, try to overload it).
   
   There is another PR (https://github.com/apache/beam/pull/29202) submitted 
recently on implementing this feature. You can refer to it for some ideas.
   
   Thanks,
   
   Shunping
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to