[jira] [Commented] (BLUR-18) Rework the MapReduce Library to implement Input/OutputFormats

Aaron McCurry (JIRA) Mon, 05 Nov 2012 18:36:14 -0800

    [ 
https://issues.apache.org/jira/browse/BLUR-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491137#comment-13491137
 ]


Aaron McCurry commented on BLUR-18:
-----------------------------------

I don't think that the BlurDocLocation needs to be Comparable (implement Raw 
Comparators).  It's really there to let you know what document you are reading 
at that moment.  The reasoning:  When each InputSplit opens a session on a 
given shard server, the act of opening the session creates a temporary snapshot 
of the indexes.  This snapshot guarantees that the document ids will not change 
while the session is open.  So the document location is just the shard index 
for the given table plus the internal Lucene document id.  After the session is 
closed, or if another session is created before or after the InputSplit creates 
it's session the internal Lucene document ids may have changed.  This is due to 
near real-time updates and or merges that have taken effect.

In any event, the document location is only valid while the session is open.  I 
just had a thought, should we just take the 2 integers (shard index + Lucene 
internal document id) and make them a single Long value?  Then we could just 
use long writable and I could simplify the thrift API to represent this as 
well.  What do you guys think?
                
> Rework the MapReduce Library to implement Input/OutputFormats
> -------------------------------------------------------------
>
>                 Key: BLUR-18
>                 URL: https://issues.apache.org/jira/browse/BLUR-18
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
>             Fix For: 0.2.0
>
>         Attachments: 0001-BLUR-ID-18-Created-New-Version-of-Files.patch, 
> 0001-BLUR-ID-18-New-Writables.patch
>
>
> Currently the only way to implement indexing is to use the BlurReducer.  A 
> better way to implement this would be to support Hadoop input/outputformats 
> in both the new and old api's.  This would allow an easier integration with 
> other Hadoop projects such as Hive and Pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BLUR-18) Rework the MapReduce Library to implement Input/OutputFormats

Reply via email to