[ 
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797465#comment-13797465
 ] 

Aaron McCurry commented on BLUR-220:
------------------------------------

Today when add/update of a row happens all the records are indexes against the 
indexwriter as a collection of documents so that they are guaranteed to be 
back-to-back.  Currently this is required for the Row Query ( 
http://incubator.apache.org/blur/docs/0.2.0/data-model.html#row_query ) to work 
properly.  Because of this requirement as the row increases in size it has to 
re-index the row over and over again.  This means that writes take a huge hit 
on performance when you are doing anything other than replacing the row.

Now I think it's possible that we could come up with a mixed approach where use 
the join query for recently updated rows and then merge them fully (somehow) 
back into the segment as back-to-back documents again without reindexing the 
entire row again.

The reason the complexity exists today is because the query time join (Row 
Query) when the documents (records) are indexed together is negligible 
regardless of size of the index.  Think of the worse case scenario for the 
query time join, and the same logical query with the Row Query will be a few 
milliseconds instead of several seconds.

Aaron

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, 
> CreateIndex.java, CreateSortedIndex.java, MyEarlyTerminatingCollector.java, 
> test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the 
> number of Records.  The current updates are performed on Lucene is by 
> deleting the document and re-adding to the index.  Unfortunately when any 
> update is perform on a Row in Blur, the entire Row has to be re-read (if the 
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made 
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a 
> given Row.  It may vary based the kind of hardware that is being used, as the 
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to