vatanrathi opened a new issue, #29146:
URL: https://github.com/apache/beam/issues/29146
### What happened?
Beam TextIO reader reads same record twice for large (~45GB) delimited
(.DAT) files . No issues encountered when same file was splitted into smaller
files and processed together.
**This behaviour is not intermittent or random. Same records appears twice
in every run.**
```
wc -l test_tx_20211101.dat
78132206 test_tx_20211101.dat
```
record count of **78132206** was also verified loading file in spark which
gives same count
Note: records are delimited with CRLF (^M$)
pipeline to read file:
```
input.apply("Read lines", TextIO.read().from(<s3 file path)
.apply("timestamp", WithTimestamp.of(e -> Instant.now()))
.setCoder(StringUtf8Coder.of())
.apply(Combine.globally(Count.<String>combineFn()).withoutDefaults())
.apply(ParDo.of(new LogOutput<>("Pcollection count: ")))
```
Same code was executed with beam 2.23.0 + aws sdk1 & beam 2.48.0 + aws sdk2
beam 2.23.0 + aws sdk1 --> Pcollection count: **78132206**
beam 2.48.0 + aws sdk2 --> Pcollection count: **78132208**
To eliminate the possibility of a record being corrupted , I deleted first
(of 2) record which appeared twice.. Now the next record to it appeared twice
and second duplicate record also shift down to next one .. this implies that
something went wrong while read at certain byte offset
**Scenerio ** records at line # 33554433 & 67108865 appeared twice
Line# item
1
.
.
**33554433 abc**
.
.
.
**67108865 xyz**
.
.
78132206
In second test, when record at line # 33554433 was deleted, now the records
at 33554433 (which was at line# 33554434) and 67108865 (which was at line#
67108866) were read twice
It seems records are read successfully until records# 33554432 and then
record#33554433 read twice
Note: 67108865 - 33554433 = 33554432
Record#33554433 --> Line 33554433 of 78132206 ; Word 871941986 of
2028601968; Byte 21374174185 of 49770215222
Record#67108865 --> Line 67108865 of 78132206 ; Word 1743499355 of
2028601968; Byte 42748346369 of 49770215222
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [X] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]