Romster commented on a change in pull request #13513:
URL: https://github.com/apache/beam/pull/13513#discussion_r544121675



##########
File path: 
sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlSourceTest.java
##########
@@ -873,6 +881,46 @@ public void testSplitAtFractionExhaustiveSingleByte() 
throws Exception {
     assertSplitAtFractionExhaustive(source, options);
   }
 
+  @Test
+  public void testNoBufferOverflowThrown() throws IOException {
+    // The magicNumber was found imperatively and will be different for 
different xml content.
+    // Test with the current setup causes BufferOverflow in
+    // XMLReader#getFirstOccurenceOfRecordElement method,
+    // if the specific corner case is not handled
+    final int magicNumber = 183;
+    StringBuilder sb = new StringBuilder();

Review comment:
       Here is a piece of XML we deal with in our service:
   from the root `ONIXMessage` we read records `Product`
   ```
   <?xml version='1.0' encoding='ISO-8859-1'?>
   <ONIXMessage xmlns="http://ns.editeur.org/onix/3.0/reference"; release="3.0">
     <Header>
       <Sender>
         <SenderIdentifier>
           <SenderIDType></SenderIDType>
           <IDValue></IDValue>
         </SenderIdentifier>
         <SenderName></SenderName>
         <EmailAddress></EmailAddress>
       </Sender>
       <Addressee>
         <AddresseeName></AddresseeName>
       </Addressee>
       <MessageNumber></MessageNumber>
       <SentDateTime></SentDateTime>
     </Header>
     <Product>
       <RecordReference></RecordReference>
       <NotificationType></NotificationType>
       <RecordSourceType></RecordSourceType>
       <ProductIdentifier>
         <ProductIDType></ProductIDType>
         <IDValue></IDValue>
       </ProductIdentifier>
       <ProductIdentifier>
         <ProductIDType></ProductIDType>
         <IDValue></IDValue>
       </ProductIdentifier>
   ```
   as you can see there are inner tags like `ProductIdentifier` and 
`ProductIDType`.
   But we also have a structure like
   ```
   <root>
   <head>
   </head>
   <record>
   </record>
   ...
   </root>
   ```
   and _in most cases_, it's not a problem
   We run it in Google Dataflow, and when it fails with BufferOverflow 
restarting the job helps - so the issue can't be easily reproduced.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to