subject:"\[jira\] \[Updated\] \(SOLR\-2382\) DIH Cache Improvements"

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2012-03-21 Thread James Dyer (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2382:
-

Fix Version/s: 3.6

DIH Cache Improvements
--

Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-properties.patch, SOLR-2382-properties.patch,
SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch,
SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
TestCachedSqlEntityProcessor.java-break-where-clause.patch,
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
TestThreaded.java.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

3. Provide a means to write the final (root entity) DIH output to a cache
rather than to Solr. Then provide a way for a subsequent DIH call to use the
cache as an Entity input. Also provide the ability to do delta updates on
such persistent caches.

4. Provide the ability to partition data across multiple caches that can
then be fed back into DIH and indexed either to varying Solr Shards, or to
the same Core in parallel.
Use Cases:
1. We needed a flexible scalable way to temporarily cache child-entity
data prior to joining to parent entities.
- Using SqlEntityProcessor with Child Entities can cause an n+1 select
problem.
- CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching
mechanism and does not scale.
- There is no way to cache non-SQL inputs (ex: flat files, xml, etc).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

3. We wanted the ability to do a delta import of only the entities that
changed.
- Lucene/Solr requires entire documents to be re-indexed, even if only a
few fields changed.
- Our data comes from 50+ complex sql queries and/or flat files.
- We do not want to incur overhead re-gathering all of this data if only 1
entity's data changed.
- Persistent DIH caches solve this problem.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

5. In the future, we may need to use Shards, creating a need to easily
partition our source data into Shards.
Implementation Details:
1. De-couple EntityProcessorBase from caching.
- Created a new interface, DIHCache two implementations:
- SortedMapBackedCache - An in-memory cache, used as default with
CachedSqlEntityProcessor (now deprecated).
- BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested
with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.
I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar,
so to use or evaluate this patch, download bdb-je from
http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html

2. Allow Entity Processors to take a cacheImpl parameter to cause the
entity data to be cached (see EntityProcessorBase DIHCacheProperties).

3. Partially De-couple SolrWriter from DocBuilder
- Created a new interface DIHWriter, two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

4. Create a new Entity Processor, DIHCacheProcessor, which reads a
persistent Cache as DIH Entity Input.

5. Support a partition parameter with both DIHCacheWriter and
DIHCacheProcessor to allow for easy partitioning of source entity data.

6. Change the semantics of entity.destroy()
- Previously, it was being called

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2012-03-21 Thread James Dyer (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382_3x.patch

Patch for 3.x includes everything already committed to Trunk as well as various
bug fixes (also in Trunk already). (Patch is here for reference only ; changes
were actually moved using svn merge)

I will commit to 3.x shortly.

DIH Cache Improvements
--

Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-properties.patch, SOLR-2382-properties.patch,
SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch,
SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382_3x.patch,
TestCachedSqlEntityProcessor.java-break-where-clause.patch,
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
TestThreaded.java.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

2. Allow Entity Processors to take a cacheImpl parameter to cause the
entity data to be cached (see EntityProcessorBase DIHCacheProperties).

3. Partially De-couple SolrWriter from DocBuilder
- Created a new interface DIHWriter, two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

4. Create a new Entity Processor, DIHCacheProcessor, which

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-29 Thread Mikhail Khludnev (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mikhail Khludnev updated SOLR-2382:
---

Attachment:
TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch
breaks withKeyAndLookup() by reordering entries in row map by LinkedHashMap()
(pls have a look to createLinkedMap())

Stacktrace is pretty the same as at the comment above. you can check by
debugger the failed instance of SortedMapBackedCache. it has desc as
primaryKeyName and messed up theMap:
{code} {another another three=[{id=3, desc=another another three}], another
three=[{id=3, desc=another three}], another two=[{id=2, desc=another two}],
one=[{desc=one, id=1}], three=[{id=3, desc=three}], two=[{id=2, desc=two}]}
{code}
Regards

DIH Cache Improvements
--

Key: SOLR-2382
URL: https://issues.apache.org/jira/browse/SOLR-2382
Project: Solr
Issue Type: New Feature
Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-properties.patch, SOLR-2382-properties.patch,
SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch,
SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
TestCachedSqlEntityProcessor.java-break-where-clause.patch,
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
TestThreaded.java.patch

Functionality:
1. Provide a pluggable caching framework for DIH so that users can choose a
cache implementation that best suits their data and application.

2. Provide a means to temporarily cache a child Entity's data without
needing to create a special cached implementation of the Entity Processor
(such as CachedSqlEntityProcessor).

2. We needed the ability to gather data from long-running entities by a
process that runs separate from our main indexing process.

4. We want the ability to index several documents in parallel (using 1.4.1,
which did not have the threads parameter).

2. Allow Entity

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread James Dyer (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter.patch

Here is an updated dihwriter patch, fixing a test bug.

DIH Cache Improvements
--

Key: SOLR-2382
URL: https://issues.apache.org/jira/browse/SOLR-2382
Project: Solr
Issue Type: New Feature
Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch,
SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch,
SOLR-2382-entities.patch, SOLR-2382-properties.patch,
SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch,
SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch,
SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch