[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749188#action_12749188
 ] 

Ashutosh Chauhan commented on PIG-934:
--------------------------------------

>> Seeking to an offset would only work for a single file - hence maybe have a 
>> separate function...

Since open() returns an input stream it is not hard to conceive of usecase when 
one would want to seek into that stream even when filespec points to a 
directory or a glob. We have to define the semantics here. What does seeking in 
a directory/glob means? One reasonable answer is to view all the files in 
directory/glob as one big logical file and offset as an offset in this logical 
file and then seek into this file. Something along the lines of :
{code}
iterator = DataStreamIterator
bytesSeen = 0;
while(itertor.hasNext()){
  open current file pointed by iterator
  bytesSeen += current file length
  if (bytesSeen > offset)
    bind to adjusted offset in current file and return
 else
    continue; 
}
{code} 

But since there is no requirement for such currently, we can catch the 
situation when seeking is asked for directory/glob and throw an exception (as 
is done in this patch).  Later on, if we decide to support it instead of 
throwing exception, we can implement whatever semantics we decide on. If we 
create a new function with separate name it will be confusing to do these 
changes later on. Moreover, if there is a different function, user of the api 
needs to "know" about it and deal with it (e.g., need of special constructor in 
POLoad). Presence/absence of offset parameter in argument list I think is a 
sufficient indicator to tell which version of overloaded open() to call if 
there is a need for seek. 
Thoughts?

> Merge join implementation currently does not seek to right point on the right 
> side input based on the offset provided by the index
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-934
>                 URL: https://issues.apache.org/jira/browse/PIG-934
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.3.1
>            Reporter: Pradeep Kamath
>            Assignee: Ashutosh Chauhan
>         Attachments: pig-934.patch
>
>
> We use POLoad to seek into right file which has the following code: 
> {noformat}
>    public void setUp() throws IOException{
>         String filename = lFile.getFileName();
>         loader = 
> (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());        
>         is = FileLocalizer.open(filename, pc);
>         loader.bindTo(filename , new BufferedPositionedInputStream(is), 
> this.offset, Long.MAX_VALUE);
>     }
> {noformat}
> Between opening the stream and bindTo we do not seek to the right offset. 
> bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to