[
https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141566#comment-16141566
]
Etienne Chauchot commented on BEAM-2802:
----------------------------------------
I have read the PR above and I agree with your comments in particular the ones
on performance and maintenance (complex generic code). Comparing to the above
PR what I have started coding:
- performance: I only add one loop in {{findSeparatorBounds()}} to iterate over
the bytes of the custom delimiter if it is set. If it is not set, the current
code (\r\n) is called.
- maintenance: I decided not to make the code generic to support parsing both
{{\r\n}} and the custom delimiter but rather to keep the code path as simple as
possible with simple {{if (customDelimiter != null)}} (see above)
- no set of delimiters but rather a {{byte[]}} delimiter that is either set or
not set.
- there is only one delimiter at a time, either {{\r\n}} or custom delimiter
because we do not want to split twice if there is both new lines and custom
separator in the text file in particular if a record is spread across more than
one line
One could argue that the custom delimiter is not needed because we could split
using new lines and then use a {{ParDo}} to split once again using the custom
delimiter. But if a record is spread between multi-line, then this approach
will generate 2 records in the output Collection.
> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>
> Key: BEAM-2802
> URL: https://issues.apache.org/jira/browse/BEAM-2802
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-extensions
> Reporter: Etienne Chauchot
> Assignee: Etienne Chauchot
> Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a
> text file into PCollection elements. It might happen that a record is spread
> across more than one line. In that case we should be able to specify a custom
> record delimiter to be used in place of the default ones.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)