[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2012-03-21 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Fix Version/s: 3.6

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Assignee: James Dyer
Priority: Minor
 Fix For: 3.6, 4.0

 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
 TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
  
 TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
  TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2012-03-21 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382_3x.patch

Patch for 3.x includes everything already committed to Trunk as well as various 
bug fixes (also in Trunk already).  (Patch is here for reference only ; changes 
were actually moved using svn merge)

I will commit to 3.x shortly.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Assignee: James Dyer
Priority: Minor
 Fix For: 3.6, 4.0

 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382_3x.patch, 
 TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
 TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
  
 TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
  TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-29 Thread Mikhail Khludnev (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-2382:
---

Attachment: 
TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch

TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch
 breaks withKeyAndLookup() by reordering entries in row map by LinkedHashMap() 
(pls have a look to createLinkedMap())

Stacktrace is pretty the same as at the comment above. you can check by 
debugger the failed instance of SortedMapBackedCache. it has desc as 
primaryKeyName and messed up theMap:
{code} {another another three=[{id=3, desc=another another three}], another 
three=[{id=3, desc=another three}], another two=[{id=2, desc=another two}], 
one=[{desc=one, id=1}], three=[{id=3, desc=three}], two=[{id=2, desc=two}]}
{code}
Regards

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
 TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
  
 TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch,
  TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter.patch

Here is an updated dihwriter patch, fixing a test bug.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
 SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter_standalone.patch

{quote}
Can you achieve your objective by just adding an implementation? I mean will 
you need to patch DIH to meet your needs?
{quote}
This patch, SOLR-2382-dihwriter_standalone.patch separates the parameter 
names so that it is entirely stand-alone.  So yes, for my purposes, we can put 
this code in a separate .jar and it will not need any patches in DIH.

{quote}
My question is, Is it relevant to a large no:of people to have this last patch ?
{quote}
Anyone who is indexing a lot of data that needs to join from multiple sources 
would probably consider using this, provided it gets well-documented.  In our 
case, we have different data sources across our enterprise that all contribute 
a few fields to each Solr document, so having this added flexibility was vital 
for us.

In any case, if you do not want to work towards committing this last patch now 
I think it would be wise to close SOLR-2382 and I can open a new case just for 
this last patch.  That'll make it easier for someone who wants the 
functionality in the future to find it in JIRA, or if some other committer 
wants to pick it up someday.  Let me know what you plan to do.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-28 Thread Mikhail Khludnev (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-2382:
---

Attachment: 
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
TestCachedSqlEntityProcessor.java-break-where-clause.patch

James,

pls find my proof for absence of where=xid=x.id support. 
TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - 
I'm  sorry for that. The test was green due to relying on keys order in the 
map. Wrapping by sorted map breaks that order and lead to peaking up wrong 
primarykey column. pls find explanation below.

from my pov the most cruel thing is 
[lines:27-28|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]
 it pick ups just first key from the map as primary key, when it wasn't 
properly detected from attributes. so this condition hides a problem, until 
just face it and address.

left part of where clause isn't used [here at lines 
45-48|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup]
 and where= is ignored again [at lines 
185-190|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]

you can see that the second attach 
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
 fixes the test by adding cachePk and lookup into attributes.

My proposals are:
* fix it. it's not a big deal to came where attr back
* but why the new attributes cachePk and cacheLoop are better than old where 
attribute ? in according to reply I vote for
** decommission where= or for 
** rolling new cahePk/Lookup attributes back
* can't we add more randomization into 
AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar 
hidden issues. I propose to use concrete map behaviour randomly: hash, sorted, 
sorted-reverse. WDYT?
* the names withWhereClause() and withKeyAndLookup() should be swapped. their 
content contradicts to [the 
names|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?view=markup]
{code}
  public void withWhereClause() {
...
query, q, DIHCacheSupport.CACHE_PRIMARY_KEY,id, 
DIHCacheSupport.CACHE_FOREIGN_KEY ,
...
  public void withKeyAndLookup() {
...
MapString, String entityAttrs = createMap(query, q, where, id=x.id,
...
{code}  


 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
 TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
  TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-27 Thread Mikhail Khludnev (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-2382:
---

Attachment: TestThreaded.java.patch

Hello,

I want to contribute test for Parent/Child usecase with 
CachedSqlentityProcessor in single- and multi-thread modes: 
TestThreaded.java.patch on r1144761
Pls, let me know, how do you feel about it?

Actually, I've explored this case at 3.4 some 
time ago, but decided to wait a little until this re-factoring made a progress.

* the first issue is testCachedSingleThread_FullImport() failure. It's caused 
by 
{code:title=DocBuilder.java line 473}
   } finally {
entityProcessor.destroy();
  }
{code}
this code, which clean-ups the cache, makes sense, but for parent entities 
only, and causes a failure for the child entities enumeration, when run() is 
called from line :510. It shouldn't be a big deal to fix. 

* then, some minor moaning: looks like where=xid=x.id is not supported by new 
code, which relies on cachePk=xid and cacheLookup=x.id. for me it's a 
matter of opinion, backward compatibility and documentation. 

* the most interesting problem is failure of 
testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child 
entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache 
is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have 
some ideas, but want to know your opinion.

Guys, I can handle some of these, let me know how I can help.

--
Mikhail

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-11-14 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter.patch

Here is an updated version of the dihwriter patch, which is the last part of 
this ticket.  This is in-sync with Noble's commit of the entities patch.

This includes:
1. DIHCacheWriter - lets DIH write to a DIHCache instead of to Solr for future 
processing.  This also supports Delta Updates on caches.
2. DIHCacheProcessor - lets DIH read from a cache that was previously  written 
by DIHCacheWriter.
3. MockDIHCache - a bare-bones cache implementation that supports persistence 
and delta updates.  This allows us to run all of the unit tests without also 
needing to apply SOLR-2613 (BerkleyBackedCache).  This also provides a full 
reference implementation for those who would like to write their own Caches.

Noble, will you be able to work through this patch also with me?

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-10-27 Thread Noble Paul (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-2382:
-

Attachment: SOLR-2382-entities.patch

With some clean up. 
I think there is a very big omission. The EntityProcessorBase.transformers 
field is not used in the latest patch. How do does transformation work?

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
 SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-10-21 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter.patch

Here is a version of the dihwriter patch compatible with today's entity 
patch update.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-10-21 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-entities.patch

re-attaching entities with grant license button selected.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-10-17 Thread James Dyer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-entities.patch

Noble,

Here is a version of the entities patch using .iterator() methods as you 
suggest.  Let me know if this is what you had in mind and also if there is 
anything else you'd like to address.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-08-05 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter.patch
SOLR-2382-entities.patch

Updating the patches to the current Trunk.  To use these, apply entities 
first, then dihwriter.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
 SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
 SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-29 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-solrwriter-verbose-fix.patch

This patch fixes verbose debugging output and can be applied to the current 
Trunk.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
 SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-22 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-entities.patch

Here is a freshly-sync'ed version of the entities patch.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
 SOLR-2382-properties.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-21 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-solrwriter.patch

Here is the solrwriter patch, sync'ed to the latest trunk, which now has the 
first patch (properties) committed...

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-14 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-dihwriter.patch

This is the DIHCacheWriter and DIHCacheProcessor portion taken from the 
June 24 version of the entire patch. This patch depends both on the 
properties patch and the solrwriter patch.

DIHWriter is a drop-in replacement for SolrWriter, allowing a DIH run to write 
its output to a DIHCache rather than to Solr.  Users can specify which 
CacheImpl to use.  If the Cache supports persistence, delta updates can be 
performed on cached data.

DIHCacheProcessor is an EntityProcessor that takes a DIHCache as its input.

Unit tests are included.  However, most tests skip because SortedMapBackedCache 
does not support persistence.  To get the tests to run, (for now), we need to 
also apply SOLR-2613 (BerkleyBackedCache).

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
 SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-13 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-2382:
-

Attachment: SOLR-2382-solrwriter.patch

The patch does not apply. the codebase has changed in the trunk. But it still 
does not apply. 

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-13 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-entities.patch

This is the Entity Processor Caching portion taken from the June 24 version of 
the entire patch.  This patch depends both on the properties patch and the 
solrwriter patch.

This gives users the ability to specify a cacheImpl parameter to any Entity 
Processor, adding caching functionality.  CachedSqlEntityProcessor is gutted 
and deprecated (with no loss of functionality).  Instead, users can now use 
SqlEntityProcessor and specify a cacheImpl.  Also, the caching functionality 
has been removed from EntityProcessorBase and placed into SortedMapBackedCache, 
which is the only Cache implementation available here (as BerkleyBackedCache is 
removed to SOLR-2613).

Two new unit tests are included in this patch, one for SortedMapBackedCache and 
another demonstrating the use of an ephemeral cache to do joins.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
 SOLR-2382-properties.patch, SOLR-2382-solrwriter.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-12 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-properties.patch

This is the properties file writer abstraction taken from the June 24 version 
of the entire patch.

This removes property file operations from SolrWriter, paving the way for 
multiple DIHWriter implementations (such as DIHCacheWriter).

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-properties.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-12 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-properties.patch

The previous patch left out a couple of things.  here's a fixed version...

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-07-12 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382-solrwriter.patch

This is the DIHWriter abstraction taken from the June 24 version of the 
entire patch.  This patch depends on SOLR-2382-properties.patch.

SolrWriter now implements the new DIHWriter interface.  Also, logging 
operations are removed from SolrWriter.  This opens the possibility of creating 
plugable DIHWriters so that DIH can write to something other than a Solr index. 
 This in turn opens up the ability to create a Writer that can write to a 
Persistent Cache for later use.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
 SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-06-24 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Here is a version that passes parameters via the Context object, rather than by 
building maps.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, I have front-ported this, enhanced, to Trunk.  

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-06-22 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Just found a little bug in SortedMapBackedCache.  This patch version includes a 
fix for it.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
 added 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-06-22 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Sorry...that last patch included some unrelated code.  This one is correct.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
 added unit tests and 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-06-22 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: (was: SOLR-2382.patch)

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
 added unit tests and verified that all existing test cases pass.  I believe 
 this patch 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-06-21 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Here's a new patch:

- Applies cleanly to latest trunk.
- Better de-coupling of SolrWriter and the PropertiesWriter
- BerkleyBackedCache removed (will open a separate issue for this).
- Unit test enhancements to make it easy to unit test new cache impl's as they 
get added.

Note that because SortedMapBackedCache does not support persistence, most of 
the features are untestable (one test was removed...The others just skip for 
now).  Three possible solutions:

- create a persistence option for SortedMapBackedCache (maybe just use java 
Serialization).
- create a new cache impl that doesn't depend on a non-compatible licensed 
product (maybe Lucene-backed, although I tried this long ago and didn't get the 
peformance I needed).
- Figure out an acceptable way to include BerkleyBackedCache (we did it for the 
Lucene db module, so why not this?)


 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-04-08 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

This version fixes a bug involving the DIHCacheProcessor in the case of a 
many-to-[one|many] join between the parent entity and a child entity.  If the 
child entity used a DIHCacheProcessor and the same child joined to consecutive 
parents, only the first parent would join to the child.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has 

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

2011-03-25 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Fix for two bugs in BerkleyBackedCache:

- If the passed-in fieldNames  fieldTypes have leading or trailing spaces, 
opening the cache would fail.
- If the cache was set up for Delta updates, then closed  re-opened, adding 
documents would cause an NPE.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
 SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this 

[jira] Updated: (SOLR-2382) DIH Cache Improvements

2011-03-16 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Updated patch with 2 fixes for things I missed when porting this from 1.4.1 to 
Trunk.  Also added 1 more unit test.

I think this is ready for someone else to evaluate if anyone has the time  
desire.  I do believe this would be a nice addition to the DIH product.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature 

[jira] Updated: (SOLR-2382) DIH Cache Improvements

2011-02-28 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

Updated patch with better delta update capabilities for DIHCacheWriter.  Also 
cleaned a couple things up.  All tests pass.

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch, SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
 added unit tests and verified that all existing test cases 

[jira] Updated: (SOLR-2382) DIH Cache Improvements

2011-02-25 Thread James Dyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Dyer updated SOLR-2382:
-

Attachment: SOLR-2382.patch

 DIH Cache Improvements
 --

 Key: SOLR-2382
 URL: https://issues.apache.org/jira/browse/SOLR-2382
 Project: Solr
  Issue Type: New Feature
  Components: contrib - DataImportHandler
Reporter: James Dyer
Priority: Minor
 Attachments: SOLR-2382.patch


 Functionality:
  1. Provide a pluggable caching framework for DIH so that users can choose a 
 cache implementation that best suits their data and application.
  
  2. Provide a means to temporarily cache a child Entity's data without 
 needing to create a special cached implementation of the Entity Processor 
 (such as CachedSqlEntityProcessor).
  
  3. Provide a means to write the final (root entity) DIH output to a cache 
 rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
 cache as an Entity input.  Also provide the ability to do delta updates on 
 such persistent caches.
  
  4. Provide the ability to partition data across multiple caches that can 
 then be fed back into DIH and indexed either to varying Solr Shards, or to 
 the same Core in parallel.
 Use Cases:
  1. We needed a flexible  scalable way to temporarily cache child-entity 
 data prior to joining to parent entities.
   - Using SqlEntityProcessor with Child Entities can cause an n+1 select 
 problem.
   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
 mechanism and does not scale.
   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
  
  2. We needed the ability to gather data from long-running entities by a 
 process that runs separate from our main indexing process.
   
  3. We wanted the ability to do a delta import of only the entities that 
 changed.
   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
 few fields changed.
   - Our data comes from 50+ complex sql queries and/or flat files.
   - We do not want to incur overhead re-gathering all of this data if only 1 
 entity's data changed.
   - Persistent DIH caches solve this problem.
   
  4. We want the ability to index several documents in parallel (using 1.4.1, 
 which did not have the threads parameter).
  
  5. In the future, we may need to use Shards, creating a need to easily 
 partition our source data into Shards.
 Implementation Details:
  1. De-couple EntityProcessorBase from caching.  
   - Created a new interface, DIHCache  two implementations:  
 - SortedMapBackedCache - An in-memory cache, used as default with 
 CachedSqlEntityProcessor (now deprecated).
 - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
 with je-4.1.6.jar
- NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar.  
 I believe this may be incompatible due to Generic Usage.
- NOTE: I did not modify the ant script to automatically get this jar, 
 so to use or evaluate this patch, download bdb-je from 
 http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
  
  2. Allow Entity Processors to take a cacheImpl parameter to cause the 
 entity data to be cached (see EntityProcessorBase  DIHCacheProperties).
  
  3. Partially De-couple SolrWriter from DocBuilder
   - Created a new interface DIHWriter,  two implementations:
- SolrWriter (refactored)
- DIHCacheWriter (allows DIH to write ultimately to a Cache).

  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
 persistent Cache as DIH Entity Input.
  
  5. Support a partition parameter with both DIHCacheWriter and 
 DIHCacheProcessor to allow for easy partitioning of source entity data.
  
  6. Change the semantics of entity.destroy()
   - Previously, it was being called on each iteration of 
 DocBuilder.buildDocument().
   - Now it is does one-time cleanup tasks (like closing or deleting a 
 disk-backed cache) once the entity processor is completed.
   - The only out-of-the-box entity processor that previously implemented 
 destroy() was LineEntitiyProcessor, so this is not a very invasive change.
 General Notes:
 We are near completion in converting our search functionality from a legacy 
 search engine to Solr.  However, I found that DIH did not support caching to 
 the level of our prior product's data import utility.  In order to get our 
 data into Solr, I created these caching enhancements.  Because I believe this 
 has broad application, and because we would like this feature to be supported 
 by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
 added unit tests and verified that all existing test cases pass.  I believe 
 this patch maintains backwards-compatibility and would be a welcome addition 
 to a future version of Solr.

-- 
This message