Pradeep Kamath commented on PIG-930:

I had spoken to Ben (who wrote the bzip2 code) and the position returned from 
getPosition() starts off being an offset on the compressed bz2 file and then 
becomes a counter on the uncompressed stream - so it is inaccurate in that it 
is neither on the compressed nor the uncompressed stream but a best effort 
inbetween position. Also during compression a single byte could mean multiple 
uncompressed bytes or viceversa. So getting accurate position on the data so we 
can get the very next tuple would be difficult.

I think besides this, the fact that we do bindTo(pos > 0 ? pos - 1 : pos) (we 
do pos -1 because bindTo will discard first tuple for pos> 0) is not very 
clean. We cannot always assume that 1 byte less than the position suggested by 
the index is the right position to bindTo so that we correctly get to the tuple 
in the index. (For example if the delimiter is multi byte, the loader may 
discard the tuple we want to get to!). Approach 2) outlined above will avoid 
this hack since we will bind to startOfDfsBlock and then do getPOsition() and 
getNext() repeatedly till we reach the position suggested in the index. The 
next getNext() should give us the exact same key as in the index since the 
index creation code follows the same sequence of bindTo()-> getPosition() -> 

> merge join should handle compressed bz2 sorted files
> ----------------------------------------------------
>                 Key: PIG-930
>                 URL: https://issues.apache.org/jira/browse/PIG-930
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Pradeep Kamath
> There are two issues - POLoad which is used to read the right side input does 
> not handle bz2 files right now. This needs to be fixed.
> Further inn the index map job we bindTo(startOfBlockOffSet) (this will 
> internally discard first tuple if offset > 0). Then we do the following:
> {noformat}
> While(tuple survives pipeline) {
>   Pos =  getPosition()
>   getNext() 
>   run the tuple  through pipeline in the right side which could have filter
> }
> Emit(key, pos, filename).
> {noformat}
> Then in the map job which does the join, we bindTo(pos > 0 ? pos  1 : pos) 
> (we do pos -1 because bindTo will discard first tuple for pos> 0). Then we do 
> getNext()
> Now in bz2 compressed files, getPosition() returns a position which is not 
> really accurate. The problem is it could be a position in the middle of a 
> compressed bz2 block. Then when we use that position to bindTo() in the final 
> map job, the code would first hunt for a bz2 block header thus skipping the 
> whole current bz2 block. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to