vatanrathi opened a new issue, #29146:
URL: https://github.com/apache/beam/issues/29146

   ### What happened?
   
   Beam TextIO reader reads same record twice for large (~45GB) delimited 
(.DAT) files . No issues encountered when same file was splitted into smaller 
files and processed together. 
   **This behaviour is not intermittent or random. Same records appears twice 
in every run.**
   
   
   ```
   wc -l test_tx_20211101.dat
   78132206 test_tx_20211101.dat 
   ```
   
   record count of **78132206** was also verified loading file in spark which 
gives same count
   Note: records are delimited  with CRLF (^M$)
   
   
   pipeline to read file:
   ```
   input.apply("Read lines", TextIO.read().from(<s3 file path)
   .apply("timestamp", WithTimestamp.of(e -> Instant.now()))
   .setCoder(StringUtf8Coder.of())
   .apply(Combine.globally(Count.<String>combineFn()).withoutDefaults())
   .apply(ParDo.of(new LogOutput<>("Pcollection count: ")))
   ```
   
   
   Same code was executed with beam 2.23.0 + aws sdk1 & beam 2.48.0 + aws sdk2
   
   beam 2.23.0 + aws sdk1 --> Pcollection count: **78132206**
   beam 2.48.0 + aws sdk2 --> Pcollection count: **78132208**
   
   To eliminate the possibility of a record being corrupted , I deleted first 
(of 2) record which appeared twice.. Now the next record to it appeared twice 
and second duplicate record also shift down to next one .. this implies that 
something went wrong while read at certain byte offset
   
   **Scenerio **  records at line # 33554433 & 67108865 appeared twice
   Line#              item
   1         
   .
   .
   **33554433   abc**
   .
   .
   .
   **67108865  xyz**
   .
   .
   78132206
   
   In second test, when record at line # 33554433 was deleted,  now the records 
at 33554433 (which was at line# 33554434)  and 67108865 (which was at line# 
67108866) were read twice 
   
   It seems records are read successfully until records# 33554432 and then 
record#33554433 read twice
   Note: 67108865 - 33554433 = 33554432
   
   Record#33554433 --> Line 33554433 of  78132206 ; Word 871941986 of 
2028601968; Byte 21374174185 of 49770215222
   Record#67108865 --> Line 67108865 of  78132206 ; Word 1743499355 of 
2028601968; Byte 42748346369 of 49770215222
   
   
   
   
   
   
   
   
   
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [X] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to