[jira] [Commented] (BLUR-220) Support for humongous Rows

Ravikumar (JIRA) Thu, 24 Oct 2013 09:59:16 -0700

    [ 
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804394#comment-13804394
 ]


Ravikumar commented on BLUR-220:
--------------------------------

Thanks for the pointers on block-cache. I will look to integrate it. May be the 
on-heap alloc can use this logic, while the off-heap can continue with the 
existing code.

I will create a new issue for the NRT updates.

Lets see the sequence

1. Tiny sorted segments make it to disk from RAM.
2. Future merges take place among already sorted segments. 
3. So, inside every segment each row will be co-located with all records. But 
still, these rows will be scattered across segments.
4. The SortingMergePolicy impl uses TimSort underneath, which means it is 
almost O(n) for already sorted data. Also, this is quite different from a 
linear-scan, as merges always try to bulk fetch-and-write data. For actual 
comparisons, please look at the details at
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13605896&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13605896

As per the link, big segment merges are actually quite fast and on-par with 
normal merges, provided the index uses no stored-fields. Otherwise, merges will 
be 2-3X slower.

Let me know, if you are convinced on this.

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, 
> CreateIndex.java, CreateSortedIndex.java, FullRowReindexing.java, 
> MyEarlyTerminatingCollector.java, SlabAllocator.java, SlabRAMDirectory.java, 
> SlabRAMFile.java, SlabRAMInputStream.java, SlabRAMOutputStream.java, 
> test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the 
> number of Records.  The current updates are performed on Lucene is by 
> deleting the document and re-adding to the index.  Unfortunately when any 
> update is perform on a Row in Blur, the entire Row has to be re-read (if the 
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made 
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a 
> given Row.  It may vary based the kind of hardware that is being used, as the 
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (BLUR-220) Support for humongous Rows

Reply via email to