[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856136#comment-13856136 ] Deepak Kumar V commented on HADOOP-9307: Doug pointed me here. I see a similar error while reading Avro file, doing random number of seeks. Details = Hello, I have a 340 MB avro data file that contains records sorted and identified by unique id (duplicate records exists). At the beginning of every unique record a synchronization point is created with DataFileWriter.sync(). (I cannot or do not want to save the sync points and i do not want to use SortedKeyValueFile as output format for M/R job) There are at-least 25k synchronization points in a 340 MB file. Ex: Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2 As records are sorted, for efficient retrieval, binary search is performed using the attached code. Most of the times the search is successful, at times the code throws the following exception -- org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210 -- Questions 1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in performance while reading ? 2) I note down the position that was used to invoke fileReader.sync(mid);. If i catch AvroRuntimeException, close and open the file and sync(mid) i do not see exception. Why should Avro throw exception before and not later ? 3) Is there a limit on number of times sync() is invoked ? 4) When sync(position) is invoked, are any 0 = position = file.size() valid ? If yes why do i see AvroRuntimeException (#2) ? == Some of the questions are irrelevant here. As the patch has been committed, what version of hadoop-core will have this fix ? BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 3.0.0, 2.1.0-beta, 1.3.0 Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856137#comment-13856137 ] Deepak Kumar V commented on HADOOP-9307: I am using hadoop-core-1.1.2.21 BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 3.0.0, 2.1.0-beta, 1.3.0 Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856168#comment-13856168 ] Harsh J commented on HADOOP-9307: - Hello Deepak, The Fix Versions lists the versions 2.1.0-beta (and onwards) for Hadoop 2.x, or 1.3.0 (and onwards) for Hadoop 1.x. -- Harsh J BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 3.0.0, 2.1.0-beta, 1.3.0 Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658229#comment-13658229 ] Hudson commented on HADOOP-9307: Integrated in Hadoop-Yarn-trunk #210 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/210/]) HADOOP-9307. BufferedFSInputStream.read returns wrong results after certain seeks. Contributed by Todd Lipcon. (Revision 1482377) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1482377 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalFileSystem.java BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 3.0.0, 2.0.5-beta Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658338#comment-13658338 ] Hudson commented on HADOOP-9307: Integrated in Hadoop-Hdfs-trunk #1399 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1399/]) HADOOP-9307. BufferedFSInputStream.read returns wrong results after certain seeks. Contributed by Todd Lipcon. (Revision 1482377) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1482377 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalFileSystem.java BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 3.0.0, 2.0.5-beta Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658374#comment-13658374 ] Hudson commented on HADOOP-9307: Integrated in Hadoop-Mapreduce-trunk #1426 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1426/]) HADOOP-9307. BufferedFSInputStream.read returns wrong results after certain seeks. Contributed by Todd Lipcon. (Revision 1482377) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1482377 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestLocalFileSystem.java BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 3.0.0, 2.0.5-beta Attachments: hadoop-9307-branch-1.txt, hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656861#comment-13656861 ] Todd Lipcon commented on HADOOP-9307: - Hey Steve. I agree that improving the general cross-filesystem testing is a worthy goal. But, this is a simple bug in an existing implementation, and the patch adds a specific unit test. Given that this breaks HBase running on the local filesystem, I don't think it makes sense to block fixing it on a much bigger project like standardizing tests. BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656872#comment-13656872 ] Harsh J commented on HADOOP-9307: - +1 - The change and the added regression test looks good. I tested it without the fix as well. Nice find Todd! BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627129#comment-13627129 ] Todd Lipcon commented on HADOOP-9307: - [~ste...@apache.org], mind taking a look? BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624108#comment-13624108 ] Hadoop QA commented on HADOOP-9307: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577294/hadoop-9307.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/2421//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/2421//console This message is automatically generated. BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-9307.txt After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578244#comment-13578244 ] Todd Lipcon commented on HADOOP-9307: - An example sequence of seeks which returns the wrong data is as follows, assuming a 4096-byte buffer: {code} seek(0); readFully(1); {code} This primes the buffer. After this, the current state of the buffered stream is {{pos=0, count=4096, filepos=4096}} {code} seek(2000); {code} The seek sees that the required data is in already in the buffer, and just sets {{pos=2000}} {code} readFully(1); {code} This first copies the remaining bytes from the buffer and sets {{pos=4096}}. Then, because 5904 bytes are remaining, and this is larger than the buffer size, it copies them directly into the user-supplied output buffer. This leaves the state of the stream at {{pos=4096, count=4096, filepos=12000}} {code} seek(11000); {code} The optimization in BufferedFSInputStream sees that there are 4096 buffered bytes, and that this seek is supposedly within the window, assuming that those 4096 bytes directly precede filepos. So, it erroneously just sets {{pos=3096}}. The next read will then get the wrong results for the first 1000 bytes -- yielding bytes 3096-4096 of the file instead of bytes 11000-12000. BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578275#comment-13578275 ] Steve Loughran commented on HADOOP-9307: Interesting. I saw some quirks with data read/writes talking to OpenStack swift, but felt that was eventual consistency related, not buffering. If you look in {{FileSystemContractBaseTest}} there's some updated code for creating test datasets and comparing byte arrays in files -that comparison code could be teased out, and/or a new test added to the contract if you seek(offset) then readFully(bytes[]), you get the data at file[offset]...file[offset+bytes.length-1] Let me add that to my list of things we assume that a filesystem does. BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-9307) BufferedFSInputStream.read returns wrong results after certain seeks
[ https://issues.apache.org/jira/browse/HADOOP-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578514#comment-13578514 ] Todd Lipcon commented on HADOOP-9307: - Yea, I have a randomized test case that finds this bug within a few seconds - basically a copy of one that I wrote for HDFS a couple years ago. Will upload it with a bugfix patch hopefully later today, but maybe early next week (pretty busy next two days). FWIW the fix is simple -- just need to add {{(this.pos != this.count)}} into the condition to run the seek-in-buffer optimization BufferedFSInputStream.read returns wrong results after certain seeks Key: HADOOP-9307 URL: https://issues.apache.org/jira/browse/HADOOP-9307 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 1.1.1, 2.0.2-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon After certain sequences of seek/read, BufferedFSInputStream can silently return data from the wrong part of the file. Further description in first comment below. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira