[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

Mikhail Khludnev (Issue Comment Edited) (JIRA) Sun, 27 Nov 2011 11:48:07 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157979#comment-13157979
 ]


Mikhail Khludnev edited comment on SOLR-2382 at 11/27/11 7:46 PM:
------------------------------------------------------------------

Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
        entityProcessor.destroy();
      }
{code}
this code, which cleanups the cache, makes sense, but for parent entities only, 
and causes a failure for the child entities enumeration, when run() is called 
from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where="xid=x.id" is not supported by new 
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child 
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache 
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have 
some ideas, but want to know your opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail
                
      was (Author: mkhludnev):
    Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
        entityProcessor.destroy();
      }
{code}
this code, which clean-ups the cache, makes sense, but for parent entities 
only, and causes a failure for the child entities enumeration, when run() is 
called from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where="xid=x.id" is not supported by new 
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child 
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache 
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have 
some ideas, but want to know your opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail
                  
> DIH Cache Improvements
> ----------------------
>
>                 Key: SOLR-2382
>                 URL: https://issues.apache.org/jira/browse/SOLR-2382
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: James Dyer
>            Priority: Minor
>         Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
>     - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
>     - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>        - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>        - NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>    - SolrWriter (refactored)
>    - DIHCacheWriter (allows DIH to write ultimately to a Cache).
>    
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented 
> destroy() was LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion in converting our search functionality from a legacy 
> search engine to Solr.  However, I found that DIH did not support caching to 
> the level of our prior product's data import utility.  In order to get our 
> data into Solr, I created these caching enhancements.  Because I believe this 
> has broad application, and because we would like this feature to be supported 
> by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
> added unit tests and verified that all existing test cases pass.  I believe 
> this patch maintains backwards-compatibility and would be a welcome addition 
> to a future version of Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements

Reply via email to