subject:"\[jira\] \[Commented\] \(SOLR\-2382\) DIH Cache Improvements"

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-12-02 Thread James Dyer (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161726#comment-13161726
]

James Dyer commented on SOLR-2382:
--

Noble,

I have attached a patch with a corrected unit test fix on SOLR-2933, to fix
one of the problems Mikhail described. Indeed the where parameter was broken
by our last commit and TestCachedSqlEntityProcessor would mask the failure and
pass anyway. Would you mind looking at my patch and committing it? Thanks.

DIH Cache Improvements
--

Key: SOLR-2382
URL: https://issues.apache.org/jira/browse/SOLR-2382
Project: Solr
Issue Type: New Feature
Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-properties.patch, SOLR-2382-properties.patch,
SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch,
SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
TestCachedSqlEntityProcessor.java-break-where-clause.patch,
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
TestThreaded.java.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

3. Provide a means to write the final (root entity) DIH output to a cache
rather than to Solr. Then provide a way for a subsequent DIH call to use the
cache as an Entity input. Also provide the ability to do delta updates on
such persistent caches.

4. Provide the ability to partition data across multiple caches that can
then be fed back into DIH and indexed either to varying Solr Shards, or to
the same Core in parallel.
Use Cases:
1. We needed a flexible scalable way to temporarily cache child-entity
data prior to joining to parent entities.
- Using SqlEntityProcessor with Child Entities can cause an n+1 select
problem.
- CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching
mechanism and does not scale.
- There is no way to cache non-SQL inputs (ex: flat files, xml, etc).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

3. We wanted the ability to do a delta import of only the entities that
changed.
- Lucene/Solr requires entire documents to be re-indexed, even if only a
few fields changed.
- Our data comes from 50+ complex sql queries and/or flat files.
- We do not want to incur overhead re-gathering all of this data if only 1
entity's data changed.
- Persistent DIH caches solve this problem.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

5. In the future, we may need to use Shards, creating a need to easily
partition our source data into Shards.
Implementation Details:
1. De-couple EntityProcessorBase from caching.
- Created a new interface, DIHCache two implementations:
- SortedMapBackedCache - An in-memory cache, used as default with
CachedSqlEntityProcessor (now deprecated).
- BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested
with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.
I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar,
so to use or evaluate this patch, download bdb-je from
http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html

2. Allow Entity Processors to take a cacheImpl parameter to cause the
entity data to be cached (see EntityProcessorBase DIHCacheProperties).

3. Partially De-couple SolrWriter from DocBuilder
- Created a new interface DIHWriter, two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

4. Create a new Entity

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-11-30 Thread Mikhail Khludnev (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160683#comment-13160683
]

Mikhail Khludnev commented on SOLR-2382:

I spawned subtask SOLR-2933

DIH Cache Improvements
--

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
TestThreaded.java.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

2. Allow Entity Processors to take a cacheImpl parameter to cause the
entity data to be cached (see EntityProcessorBase DIHCacheProperties).

3. Partially De-couple SolrWriter from DocBuilder
- Created a new interface DIHWriter, two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

4. Create a new Entity Processor, DIHCacheProcessor, which reads a
persistent Cache as DIH Entity Input.

5. Support a partition parameter with both DIHCacheWriter and
DIHCacheProcessor to allow for easy partitioning of source entity data.

6. Change the semantics of entity.destroy()
- Previously,

[jira] [Commented] (SOLR-2382) DIH Cache Improvements

2011-11-30 Thread Noble Paul (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160688#comment-13160688
]

Noble Paul commented on SOLR-2382:
--

@James
Yes create a new issue for all the further functionalities and let's close this
one