[ 
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799880#comment-13799880
 ] 

Ravikumar commented on BLUR-220:
--------------------------------

Thanks for the link. Now I understand how this is all related.

I was thinking of another idea, that I wanted to get your opinion on.

Now that we have a SortingMergePolicy from lucene, it is actually possible to 
co-locate all records of a given row. This will work during a segment-merge, 
but newly added records of the row will still be scattered and will take it's 
own time to participate in a merge.

Instead for an online indexing case, where we have records continuously 
trickling in for all rows, will it be good to do something like Zoie's search 
system, where incoming operations directly buffer to RAM and not to disk. Since 
we already have a transaction log, recovery is in-built for Blur. The details 
are at https://code.google.com/p/zoie/wiki/ZoieSystem

The basic idea her is to divide allocated RAM into RAM-A and RAM-B. All 
document operations go into RAM-A. When RAM-A is full, swap RAM-A and RAM-B. A 
custom searcher will wrap both RAMDir & disk-based Dirs to return final set of 
results. This is almost similar to HBase Memstore, except that we have 2 slots 
of memory.

Instead of us blindly flushing the full RAM to disk, we apply our 
SortingMergePolicy on this RAM and then flush to disk. By this approach, even 
fresh segments will have all records per-row co-located.

All existing functionalities of Blur can easily then work, right?

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, 
> CreateIndex.java, CreateSortedIndex.java, FullRowReindexing.java, 
> MyEarlyTerminatingCollector.java, test_results.txt, TestSearch.java, 
> TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the 
> number of Records.  The current updates are performed on Lucene is by 
> deleting the document and re-adding to the index.  Unfortunately when any 
> update is perform on a Row in Blur, the entire Row has to be re-read (if the 
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made 
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a 
> given Row.  It may vary based the kind of hardware that is being used, as the 
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to