[jira] [Commented] (BLUR-220) Support for humongous Rows

Ravikumar (JIRA) Thu, 17 Oct 2013 00:42:19 -0700

    [ 
https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797690#comment-13797690
 ]


Ravikumar commented on BLUR-220:
--------------------------------

I have 2 basic doubts.

Row-Query:

                  Typically, in no-sql world a row-query is always by a rowId. 
But I gather from this link 
[http://incubator.apache.org/blur/docs/0.2.0/data-model.html#row_query] that, a 
row-query in Blur means actually a query across rowIds. 

                  In our system, we never query anything without the rowId, as 
rowId=userId. It may be possible to have multiple rowIds in the
query in some rare-cases, but there is never a query without it. Which is why, 
in the test-cases I submitted, all queries have a RowID ["id" field], whereas 
your test cases does not have it. Am I correct in this understanding?

                  For a system like mine, it should still be fine to scatter 
documents across segments as RowID filter-caches will be readily available and 
the rest is left to lucene. Online indexing is so heavy that re-indexing even 
once is a major exercise for us. Definitely, the current approach of continuous 
re-indexing is unviable, at least for us.

"Today when add/update of a row happens all the records are indexes against the 
indexwriter as a collection of documents so that they are guaranteed to be 
back-to-back. Currently this is required for the Row Query"

-- Technically, can you point me to the code where I can see this back-to-back 
dependency for row-queries, or is it related to performance alone?

Apologies for my persistent questions. I am completely newbie and just now 
starting up with Blur. 

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, 
> CreateIndex.java, CreateSortedIndex.java, MyEarlyTerminatingCollector.java, 
> test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the 
> number of Records.  The current updates are performed on Lucene is by 
> deleting the document and re-adding to the index.  Unfortunately when any 
> update is perform on a Row in Blur, the entire Row has to be re-read (if the 
> RowMutationType is UPDATE_ROW) and then whatever modification needs are made 
> then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a 
> given Row.  It may vary based the kind of hardware that is being used, as the 
> Row grows in size the indexing (mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (BLUR-220) Support for humongous Rows

Reply via email to