[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158689#comment-13158689 ] Mikhail Khludnev edited comment on SOLR-2382 at 11/28/11 7:52 PM: -- James, pls find my proof for absence of where="xid=x.id" support. TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - I'm sorry for that. The test was green due to relying on keys order in the map. Wrapping by sorted map breaks that order and lead to peaking up wrong primarykey column. pls find explanation below. from my pov the most cruel thing is [lines:27-28|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup] it pick ups just first key from the map as primary key, when it wasn't properly detected from attributes. so this condition hides a problem, until just face it and address. left part of where clause isn't used [here at lines 45-48|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup] and "where=""" is ignored again [at lines 185-190|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup] you can see that the second attach TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch fixes the test by adding cachePk and lookup into attributes. My proposals are: * fix it. it's not a big deal to came where attr back * but why the new attributes cachePk and cacheLoop are better than old where attribute ? in according to reply I vote for ** decommission where="" or for ** rolling new cahePk/Lookup attributes back * can't we add more randomization into AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar hidden issues. I propose to choose concrete map behaviour randomly: hash, sorted, sorted-reverse. WDYT? * the names withWhereClause() and withKeyAndLookup() should be swapped. their content contradicts to [the names|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?view=markup] {code} public void withWhereClause() { ... "query", q, DIHCacheSupport.CACHE_PRIMARY_KEY,"id", DIHCacheSupport.CACHE_FOREIGN_KEY ," ... public void withKeyAndLookup() { ... Map entityAttrs = createMap("query", q, "where", "id=x.id", ... {code} was (Author: mkhludnev): James, pls find my proof for absence of where="xid=x.id" support. TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - I'm sorry for that. The test was green due to relying on keys order in the map. Wrapping by sorted map breaks that order and lead to peaking up wrong primarykey column. pls find explanation below. from my pov the most cruel thing is [lines:27-28|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup] it pick ups just first key from the map as primary key, when it wasn't properly detected from attributes. so this condition hides a problem, until just face it and address. left part of where clause isn't used [here at lines 45-48|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup] and "where=""" is ignored again [at lines 185-190|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup] you can see that the second attach TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch fixes the test by adding cachePk and lookup into attributes. My proposals are: * fix it. it's not a big deal to came where attr back * but why the new attributes cachePk and cacheLoop are better than old where attribute ? in according to reply I vote for ** decommission where="" or for ** rolling new cahePk/Lookup attributes back * can't we add more randomization into AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar hidden issues. I propose to use concrete map behaviour randomly: hash, sorted, sorted-reverse. WDYT? * the names withWhereClause() and withKeyAndLookup() should be swapped. their content contradicts to [the names|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?view=markup] {code} public void withWhereClause() { ... "query", q, DIHCacheSupport.CACHE_PRIMARY_KEY,"id", DIHCacheSupport.CACHE_
[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158623#comment-13158623 ] Mikhail Khludnev edited comment on SOLR-2382 at 11/28/11 6:47 PM: -- James, Pls find my replies below {quote} Currently none of the existing unit tests fail with this change and I'm not sure exactly how to reproduce the problem you're describing. Could you create a failing unit test for this to clarify what you're experiencing? {quote} I've done it yesterday, I attached TestThreaded.java.patch on trunk r1144761. Can you see it in attachments? After you applied it both of new tests fail. The simplest one testCachedSingleThread_FullImport() can be fixed by disabling cache destroying by commenting DocBuilder.java line:473 - // entityProcessor.destroy(); I believe it should be properly covered by the separate issue. {quote} Once again, a failing unit test would be helpful in knowing how to reproduce the specific problem you've found. {quote} testCachedSingleThread_FullImport() fails again if you remove cachePk="xid" cacheLookup="x.id" from test config constant line 109. I briefly look into TestCachedSQLEntityProcessor it seems to me I've got where it is not accurate enough, let me attach my findings soon. {quote} But if its happening in 3.4 then its not related to SOLR-2382, which is in Trunk/4.0 only {quote} The problem the same as it is at 3.x. My test TestThreaded.java.patch has testCached_*Multi*_ Thread_FullImport() which covers particularly this race condition. You can see it after make testCached_*Single*_Thread_FullImport() green somehow (see issue no.1). I can't spawn it as new issue because of the blocker no.1 was (Author: mkhludnev): James, Pls find my replies below {quote} Currently none of the existing unit tests fail with this change and I'm not sure exactly how to reproduce the problem you're describing. Could you create a failing unit test for this to clarify what you're experiencing? {quote} I've done it yesterday, I attached TestThreaded.java.patch on trunk r1144761. Can you see it in attachments? After you applied it both of new tests fail. The simplest one testCachedSingleThread_FullImport() can be fixed by disabling cache destroying by commenting DocBuilder.java line:473 - // entityProcessor.destroy(); I believe it should be properly covered by the separate issue. {quote} Once again, a failing unit test would be helpful in knowing how to reproduce the specific problem you've found. {quote} testCachedSingleThread_FullImport() fails again if you remove cachePk="xid" cacheLookup="x.id" from test config constant line 109. I briefly look into TestCachedSQLEntityProcessor it seems to me I've got where it is not accurate enough, let me attach my findings soon. {quote} But if its happening in 3.4 then its not related to SOLR-2382, which is in Trunk/4.0 only {quote} The problem the same as it is at 3.x. My test TestThreaded.java.patch has testCached_*Multi*_Thread_FullImport() which covers particularly this race condition. You can see it after make testCached_*Single*_Thread_FullImport() green somehow (see issue no.1). I can't spawn it as new issue because of the blocker no.1 > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-properties.patch, > SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also pr
[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157979#comment-13157979 ] Mikhail Khludnev edited comment on SOLR-2382 at 11/27/11 7:47 PM: -- Hello, I want to contribute test for Parent/Child usecase with CachedSqlentityProcessor in single- and multi-thread modes: TestThreaded.java.patch on r1144761 Pls, let me know, how do you feel about it? Actually, I've explored this case at 3.4 some time ago, but decided to wait a little until this re-factoring made a progress. * the first issue is testCachedSingleThread_FullImport() failure. It's caused by {code:title=DocBuilder.java line 473} } finally { entityProcessor.destroy(); } {code} this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration, when run() is called from line :510. It shouldn't be a big deal to fix. * then, some minor moaning: looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a matter of opinion, backward compatibility and documentation. * the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by concurrent access of child entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have some ideas, but want to know your opinion. Guys, I can handle some of these, let me know how I can help. -- Mikhail was (Author: mkhludnev): Hello, I want to contribute test for Parent/Child usecase with CachedSqlentityProcessor in single- and multi-thread modes: TestThreaded.java.patch on r1144761 Pls, let me know, how do you feel about it? Actually, I've explored this case at 3.4 some time ago, but decided to wait a little until this re-factoring made a progress. * the first issue is testCachedSingleThread_FullImport() failure. It's caused by {code:title=DocBuilder.java line 473} } finally { entityProcessor.destroy(); } {code} this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration, when run() is called from line :510. It shouldn't be a big deal to fix. * then, some minor moaning: looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a matter of opinion, backward compatibility and documentation. * the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have some ideas, but want to know your opinion. Guys, I can handle some of these, let me know how I can help. -- Mikhail > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-properties.patch, SOLR-2382-properties.patch, > SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to
[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157979#comment-13157979 ] Mikhail Khludnev edited comment on SOLR-2382 at 11/27/11 7:46 PM: -- Hello, I want to contribute test for Parent/Child usecase with CachedSqlentityProcessor in single- and multi-thread modes: TestThreaded.java.patch on r1144761 Pls, let me know, how do you feel about it? Actually, I've explored this case at 3.4 some time ago, but decided to wait a little until this re-factoring made a progress. * the first issue is testCachedSingleThread_FullImport() failure. It's caused by {code:title=DocBuilder.java line 473} } finally { entityProcessor.destroy(); } {code} this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration, when run() is called from line :510. It shouldn't be a big deal to fix. * then, some minor moaning: looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a matter of opinion, backward compatibility and documentation. * the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have some ideas, but want to know your opinion. Guys, I can handle some of these, let me know how I can help. -- Mikhail was (Author: mkhludnev): Hello, I want to contribute test for Parent/Child usecase with CachedSqlentityProcessor in single- and multi-thread modes: TestThreaded.java.patch on r1144761 Pls, let me know, how do you feel about it? Actually, I've explored this case at 3.4 some time ago, but decided to wait a little until this re-factoring made a progress. * the first issue is testCachedSingleThread_FullImport() failure. It's caused by {code:title=DocBuilder.java line 473} } finally { entityProcessor.destroy(); } {code} this code, which clean-ups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration, when run() is called from line :510. It shouldn't be a big deal to fix. * then, some minor moaning: looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a matter of opinion, backward compatibility and documentation. * the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by sharing child entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have some ideas, but want to know your opinion. Guys, I can handle some of these, let me know how I can help. -- Mikhail > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-properties.patch, SOLR-2382-properties.patch, > SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, TestThreaded.java.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily
[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125693#comment-13125693 ] Noble Paul edited comment on SOLR-2382 at 10/12/11 9:54 AM: The DIHCache interface should not have a the following methods {code:java} /** * * Get the next document in the cache or NULL if the last record has been reached. * Use this method to efficiently iterate through the entire cache. * * @return */ public Map getNext() ; /** * Reset the cache's internal iterator so that a subsequent call to getNext() will return the first cached row */ public void resetNext() ; {code} instead we should have a methods {code:java} Iterator> getIterator() Iterator> getIterator(String key) {code} This means all the states maintained for iteration will go away If we can do that the following fields will be redundant in SortedMapBackedCache {code:java} private Iterator>>> theMapIter = null; private Object currentKey = null; private List> currentKeyResult = null; private Iterator> currentKeyResultIter = null; {code} was (Author: noble.paul): The DIHCache interface should not have a the following methods {code:java} /** * * Get the next document in the cache or NULL if the last record has been reached. * Use this method to efficiently iterate through the entire cache. * * @return */ public Map getNext() ; /** * Reset the cache's internal iterator so that a subsequent call to getNext() will return the first cached row */ public void resetNext() ; {code} instead we should have a methods {code:java} Iterator> getIterator() Iterator> getIterator(String key) {code} This means all the states maintained for iteration will go away > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-properties.patch, > SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implementation Details: > 1. De-
[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125376#comment-13125376 ] Pulkit Singhal edited comment on SOLR-2382 at 10/11/11 8:49 PM: James, I'm trying to figure out how the data-config.xml would be written to support this case: {code} We needed a flexible & scalable way to temporarily cache child-entity data prior to joining to parent entities. {code} 1) Can you please provide a sample? 2) Also my guess is that testing this would feature would require the application of both entities.patch and dihwriter.patch ... Correct? I am guessing so as I think that the dihwriter.patch has the code changes for data to be written-to and consumed-from a cache. In my use-case there is one huge text fixed-position child-entity file which has: {code} parentDataID_1 someData1 parentDataID_1 someData2 parentDataID_2 someData3 parentDataID_2 someData4 {code} So I just want to make sure I'm getting the right set of patches and understanding the application of the use case properly. Please advice. was (Author: pulkitsing...@gmail.com): James, I'm trying to figure out how the data-config.xml would be written to support this case: {code} We needed a flexible & scalable way to temporarily cache child-entity data prior to joining to parent entities. {code} 1) Can you please provide a sample? 2) Also my guess is that testing this would feature would require the application of both entities.patch and dihwriter.patch ... Correct? I am guessing so as I think that the dihwriter.patch has the code changes for data to be written-to and consumed-from a cache. In my use-case there is one huge text fixed-position child-entity file which has: parentDataID_1 someData1 parentDataID_1 someData2 parentDataID_2 someData3 parentDataID_2 someData4 So I just want to make sure I'm getting the right set of patches and understanding the application of the use case properly. Please advice. > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-properties.patch, > SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implem
[jira] [Issue Comment Edited] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054263#comment-13054263 ] Noble Paul edited comment on SOLR-2382 at 6/24/11 6:42 AM: --- bq.cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation it doen't matter . It can use any params which are relevant to it. Anyway you can't define what params are required for a future DIHCache impl. Look at a Transformer implementation it can read anything it wants. The cache should be initialized like that only Why should the DocBuilder be even aware of DIHCache , Should it not be kept local to the EntityProcessor? was (Author: noble.paul): bq.cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation it doen't matter . It can use any params which are relevant to it. Anyway you can't define what params are required for a future DIHCache impl. Look at a Transformer implementation it can read anything it wants. The cache should be initialized like that only > DIH Cache Improvements > -- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler >Reporter: James Dyer >Priority: Minor > Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implementation Details: > 1. De-couple EntityProcessorBase from caching. > - Created a new interface, DIHCache & two implementations: > - SortedMapBackedCache - An in-memory cache, used as default with > CachedSqlEntityProcessor (now deprecated). > - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested > with je-4.1.6.jar >- NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar. > I believe this may be incompatible due to Generic Usage. >- NOTE: I did not modify the ant script to automatically get this jar, > so to use or evaluate this patch, download bdb-je from > http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html > > 2. Allow Entity Processors to take a "cacheImpl" parameter to cause the > entity data to be cached (see EntityProcessorBase & DIHCacheProperties). > > 3. Partially De-couple SolrWriter from DocBuilder > - Created a new interface DIHWriter, & two implementations: >- SolrWriter (refactored) >- DIHCacheWriter (allows DIH to write ultimately to a Cache). > > 4. Create a new Entity Processor, DIHCacheProcessor, which reads a > persistent Cache as DIH Entity Input. > > 5. Support a "partition" parameter with both DIHCacheWriter and > DIHCacheP