[ 
https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141566#comment-16141566
 ] 

Etienne Chauchot edited comment on BEAM-2802 at 8/25/17 12:51 PM:
------------------------------------------------------------------

I have read the PR above and I agree with your comments  in particular the ones 
on performance and maintenance (complex generic code). Comparing to the above 
PR what I have started coding:
- performance: I only add one loop in {{findSeparatorBounds()}} to iterate over 
the bytes of the custom delimiter if it is set. If it is not set, the current 
code (\r\n) is called.
- maintenance: I decided not to make the code generic to support parsing both 
{{\r\n}} and the custom delimiter but rather to keep the code path as simple as 
possible with simple {{if (customDelimiter != null)}} (see above)
- no {{Set}} of delimiters but rather a {{byte[]}} delimiter that is either set 
or not set. 
- there is only one delimiter at a time, either {{\r\n}} or custom delimiter 
because we do not want to split twice if there is both new lines and custom 
separator in the text file in particular if a record is spread across more than 
one line

One could argue that the custom delimiter is not needed because we could split 
using new lines and then use a {{ParDo}} to split once again using the custom 
delimiter. But if a record is spread between multi-line, then this approach 
will generate 2 records in the output Collection.
 



was (Author: echauchot):
I have read the PR above and I agree with your comments  in particular the ones 
on performance and maintenance (complex generic code). Comparing to the above 
PR what I have started coding:
- performance: I only add one loop in {{findSeparatorBounds()}} to iterate over 
the bytes of the custom delimiter if it is set. If it is not set, the current 
code (\r\n) is called.
- maintenance: I decided not to make the code generic to support parsing both 
{{\r\n}} and the custom delimiter but rather to keep the code path as simple as 
possible with simple {{if (customDelimiter != null)}} (see above)
- no set of delimiters but rather a {{byte[]}} delimiter that is either set or 
not set. 
- there is only one delimiter at a time, either {{\r\n}} or custom delimiter 
because we do not want to split twice if there is both new lines and custom 
separator in the text file in particular if a record is spread across more than 
one line

One could argue that the custom delimiter is not needed because we could split 
using new lines and then use a {{ParDo}} to split once again using the custom 
delimiter. But if a record is spread between multi-line, then this approach 
will generate 2 records in the output Collection.
 


> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>
>                 Key: BEAM-2802
>                 URL: https://issues.apache.org/jira/browse/BEAM-2802
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a 
> text file into PCollection elements. It might happen that a record is spread 
> across more than one line. In that case we should be able to specify a custom 
> record delimiter to be used in place of the default ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to