damccorm opened a new issue, #20585: URL: https://github.com/apache/beam/issues/20585
I am attempting to parse a very large CSV (65 million lines) with BEAM (version 2.25) from an Azure Blob and have created a pipeline for this. I am running the pipeline on dataflow and testing with a smaller version of the file (10'000 lines). I am using FileIO and the filesystem prefix "azfs" to read from azure blobs. The pipeline works with the small test file, but when I run this on the bigger file I am getting an exception "Stream Mark Expired" (pasted below). Reading the same file from a GCP bucket works just fine, including when running with dataflow. The CSV file I am attempting to ingest is 54.2 GB and can be obtained here: https://obis.org/manual/access/ Imported from Jira [BEAM-11313](https://issues.apache.org/jira/browse/BEAM-11313). Original Jira may contain additional context. Reported by: [email protected]. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
