stevedlawrence commented on pull request #472: URL: https://github.com/apache/incubator-daffodil/pull/472#issuecomment-760374203
I have done that, and done so with a multi gig file. So that stuff does work. The problem was the schema I used for that just too simple and required zero backtracking. So we were throwing away buckets earlier than we should, but it didnt matter since we never used those buckets. This particluar CSV schema, while it doesn't require much backtracking, does require a little bit of lookahead when scanning for delimiters. I think what happend in this case is we scanned for a delimiter, were overzealous in getting rid of the previous bucket in doing so, and then when the scanner came back to read the data, the bucket containing that data was gone. What we probably really need is a test that consumes a bunch of data, but has some a speculative parse that backtracks just a little less than 256MB. That huge backtrack should be allowed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
