[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749901#action_12749901 ] Ashutosh Chauhan commented on PIG-934: -- All tests passed on my local box. Not sure why they failed on hudson. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934_2.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749853#action_12749853 ] Hadoop QA commented on PIG-934: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418219/pig-934_2.patch against trunk revision 806668. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/5/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/5/console This message is automatically generated. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934_2.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749822#action_12749822 ] Giridharan Kesavan commented on PIG-934: I resubmitted the patch to hudson as the core tests failed for not finding javac. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934_2.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749806#action_12749806 ] Hadoop QA commented on PIG-934: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418219/pig-934_2.patch against trunk revision 806668. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/4/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/4/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/4/console This message is automatically generated. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934_2.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749669#action_12749669 ] Pradeep Kamath commented on PIG-934: Agree with both the above comments. I was wondering if instead of returning an InputStream, if the code could return a SeekableInputStream it would be usable in other scenarios (like creating a CBZip2InputStream out of it - this would be needed for http://issues.apache.org/jira/browse/PIG-930 for example). Callers only needing an InputStream would still be able to use the method. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749196#action_12749196 ] Dmitriy V. Ryaboy commented on PIG-934: --- Throwing an exception when a seek is past the file boundary seems acceptable to me (and preferable to adding new functions and changing upstream code that shouldn't care about this detail). Especially since if there is a way to get a consistent ordering among files in a directory, it's trivial to later update this code to seek past file boundaries and into the next file. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749188#action_12749188 ] Ashutosh Chauhan commented on PIG-934: -- >> Seeking to an offset would only work for a single file - hence maybe have a >> separate function... Since open() returns an input stream it is not hard to conceive of usecase when one would want to seek into that stream even when filespec points to a directory or a glob. We have to define the semantics here. What does seeking in a directory/glob means? One reasonable answer is to view all the files in directory/glob as one big logical file and offset as an offset in this logical file and then seek into this file. Something along the lines of : {code} iterator = DataStreamIterator bytesSeen = 0; while(itertor.hasNext()){ open current file pointed by iterator bytesSeen += current file length if (bytesSeen > offset) bind to adjusted offset in current file and return else continue; } {code} But since there is no requirement for such currently, we can catch the situation when seeking is asked for directory/glob and throw an exception (as is done in this patch). Later on, if we decide to support it instead of throwing exception, we can implement whatever semantics we decide on. If we create a new function with separate name it will be confusing to do these changes later on. Moreover, if there is a different function, user of the api needs to "know" about it and deal with it (e.g., need of special constructor in POLoad). Presence/absence of offset parameter in argument list I think is a sufficient indicator to tell which version of overloaded open() to call if there is a need for seek. Thoughts? > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749042#action_12749042 ] Pradeep Kamath commented on PIG-934: The reason I thought a separate function with a "singleFile" in the name was needed was because the current FileLocalizer.open() can handle directories and hence returns a DataStorageInputStreamIterator which internally iterates over the underlying multiple streams of the files in the directory. Keeping the same name may give the impression that the same capability is present even for the version which seeks to an offset. Seeking to an offset would only work for a single file - hence maybe have a separate function where the name implies this restriction might be cleaner. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749036#action_12749036 ] Ashutosh Chauhan commented on PIG-934: -- Also this doesnt warrant a new constructor in POLoad for seeking. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath >Assignee: Ashutosh Chauhan > Attachments: pig-934.patch > > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index
[ https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748070#action_12748070 ] Pradeep Kamath commented on PIG-934: To get an idea of how this seeking in case of regular loads in map tasks, I looked at PigSlice.java, the seek happens in the init() code before bindTo(): {code} public void init(DataStorage base) throws IOException { .. fsis = base.asElement(base.getActiveContainer(), file).sopen(); fsis.seek(start, FLAGS.SEEK_CUR); end = start + getLength(); if (file.endsWith(".bz") || file.endsWith(".bz2")) { is = new CBZip2InputStream(fsis, 9); } else if (file.endsWith(".gz")) { is = new GZIPInputStream(fsis); // We can't tell how much of the underlying stream GZIPInputStream // has actually consumed end = Long.MAX_VALUE; } else { is = fsis; } loader.bindTo(file.toString(), new BufferedPositionedInputStream(is, start), start, end); } {code} I think we need a FileLocalizer.sOpenSingleFile() method which can return a SeekableInputStream and we can use that in setup() in POLoad. Something along the lines of : {code} static public InputStream open(String fileSpec, PigContext pigContext) throws IOException { fileSpec = checkDefaultPrefix(pigContext.getExecType(), fileSpec); if (!fileSpec.startsWith(LOCAL_PREFIX)) { init(pigContext); ElementDescriptor elem = pigContext.getDfs().asElement(fullPath(fileSpec, pigContext)); return elem.sopen(); } else { fileSpec = fileSpec.substring(LOCAL_PREFIX.length()); //buffering because we only want buffered streams to be passed to load functions. /*return new BufferedInputStream(new FileInputStream(fileSpec));*/ init(pigContext); ElementDescriptor elem = pigContext.getLfs().asElement(fullPath(fileSpec, pigContext)); return elem.sopen; } } {code} The above code would only work with single files and not dirs which should be ok for merge join. We should probably set this up with a new constructor in POLoad which also indicates that a single file is being processed. > Merge join implementation currently does not seek to right point on the right > side input based on the offset provided by the index > -- > > Key: PIG-934 > URL: https://issues.apache.org/jira/browse/PIG-934 > Project: Pig > Issue Type: Bug >Affects Versions: 0.3.1 >Reporter: Pradeep Kamath > > We use POLoad to seek into right file which has the following code: > {noformat} >public void setUp() throws IOException{ > String filename = lFile.getFileName(); > loader = > (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); > is = FileLocalizer.open(filename, pc); > loader.bindTo(filename , new BufferedPositionedInputStream(is), > this.offset, Long.MAX_VALUE); > } > {noformat} > Between opening the stream and bindTo we do not seek to the right offset. > bindTo itself does not perform any seek. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.