[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157979#comment-13157979
]
Mikhail Khludnev edited comment on SOLR-2382 at 11/27/11 7:46 PM:
------------------------------------------------------------------
Hello,
I want to contribute test for Parent/Child usecase with
CachedSqlentityProcessor in single- and multi-thread modes:
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?
Actually, I've explored this case at 3.4 some
time ago, but decided to wait a little until this re-factoring made a progress.
* the first issue is testCachedSingleThread_FullImport() failure. It's caused
by
{code:title=DocBuilder.java line 473}
} finally {
entityProcessor.destroy();
}
{code}
this code, which cleanups the cache, makes sense, but for parent entities only,
and causes a failure for the child entities enumeration, when run() is called
from line :510. It shouldn't be a big deal to fix.
* then, some minor moaning: looks like where="xid=x.id" is not supported by new
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a
matter of opinion, backward compatibility and documentation.
* the most interesting problem is failure of
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have
some ideas, but want to know your opinion.
Guys, I can handle some of these, let me know how I can help.
--
Mikhail
was (Author: mkhludnev):
Hello,
I want to contribute test for Parent/Child usecase with
CachedSqlentityProcessor in single- and multi-thread modes:
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?
Actually, I've explored this case at 3.4 some
time ago, but decided to wait a little until this re-factoring made a progress.
* the first issue is testCachedSingleThread_FullImport() failure. It's caused
by
{code:title=DocBuilder.java line 473}
} finally {
entityProcessor.destroy();
}
{code}
this code, which clean-ups the cache, makes sense, but for parent entities
only, and causes a failure for the child entities enumeration, when run() is
called from line :510. It shouldn't be a big deal to fix.
* then, some minor moaning: looks like where="xid=x.id" is not supported by new
code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a
matter of opinion, backward compatibility and documentation.
* the most interesting problem is failure of
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have
some ideas, but want to know your opinion.
Guys, I can handle some of these, let me know how I can help.
--
Mikhail
> DIH Cache Improvements
> ----------------------
>
> Key: SOLR-2382
> URL: https://issues.apache.org/jira/browse/SOLR-2382
> Project: Solr
> Issue Type: New Feature
> Components: contrib - DataImportHandler
> Reporter: James Dyer
> Priority: Minor
> Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
> SOLR-2382-entities.patch, SOLR-2382-entities.patch,
> SOLR-2382-properties.patch, SOLR-2382-properties.patch,
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch,
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch
>
>
> Functionality:
> 1. Provide a pluggable caching framework for DIH so that users can choose a
> cache implementation that best suits their data and application.
>
> 2. Provide a means to temporarily cache a child Entity's data without
> needing to create a special cached implementation of the Entity Processor
> (such as CachedSqlEntityProcessor).
>
> 3. Provide a means to write the final (root entity) DIH output to a cache
> rather than to Solr. Then provide a way for a subsequent DIH call to use the
> cache as an Entity input. Also provide the ability to do delta updates on
> such persistent caches.
>
> 4. Provide the ability to partition data across multiple caches that can
> then be fed back into DIH and indexed either to varying Solr Shards, or to
> the same Core in parallel.
> Use Cases:
> 1. We needed a flexible & scalable way to temporarily cache child-entity
> data prior to joining to parent entities.
> - Using SqlEntityProcessor with Child Entities can cause an "n+1 select"
> problem.
> - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching
> mechanism and does not scale.
> - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>
> 2. We needed the ability to gather data from long-running entities by a
> process that runs separate from our main indexing process.
>
> 3. We wanted the ability to do a delta import of only the entities that
> changed.
> - Lucene/Solr requires entire documents to be re-indexed, even if only a
> few fields changed.
> - Our data comes from 50+ complex sql queries and/or flat files.
> - We do not want to incur overhead re-gathering all of this data if only 1
> entity's data changed.
> - Persistent DIH caches solve this problem.
>
> 4. We want the ability to index several documents in parallel (using 1.4.1,
> which did not have the "threads" parameter).
>
> 5. In the future, we may need to use Shards, creating a need to easily
> partition our source data into Shards.
> Implementation Details:
> 1. De-couple EntityProcessorBase from caching.
> - Created a new interface, DIHCache & two implementations:
> - SortedMapBackedCache - An in-memory cache, used as default with
> CachedSqlEntityProcessor (now deprecated).
> - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested
> with je-4.1.6.jar
> - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.
> I believe this may be incompatible due to Generic Usage.
> - NOTE: I did not modify the ant script to automatically get this jar,
> so to use or evaluate this patch, download bdb-je from
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html
>
> 2. Allow Entity Processors to take a "cacheImpl" parameter to cause the
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>
> 3. Partially De-couple SolrWriter from DocBuilder
> - Created a new interface DIHWriter, & two implementations:
> - SolrWriter (refactored)
> - DIHCacheWriter (allows DIH to write ultimately to a Cache).
>
> 4. Create a new Entity Processor, DIHCacheProcessor, which reads a
> persistent Cache as DIH Entity Input.
>
> 5. Support a "partition" parameter with both DIHCacheWriter and
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>
> 6. Change the semantics of entity.destroy()
> - Previously, it was being called on each iteration of
> DocBuilder.buildDocument().
> - Now it is does one-time cleanup tasks (like closing or deleting a
> disk-backed cache) once the entity processor is completed.
> - The only out-of-the-box entity processor that previously implemented
> destroy() was LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion in converting our search functionality from a legacy
> search engine to Solr. However, I found that DIH did not support caching to
> the level of our prior product's data import utility. In order to get our
> data into Solr, I created these caching enhancements. Because I believe this
> has broad application, and because we would like this feature to be supported
> by the Community, I have front-ported this, enhanced, to Trunk. I have also
> added unit tests and verified that all existing test cases pass. I believe
> this patch maintains backwards-compatibility and would be a welcome addition
> to a future version of Solr.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]