[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )
[ https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-13749: - Description: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query parser will do an intersection based on join keys between 2 collections. The local collection is the collection that you are searching against. The remote collection is the collection that contains the join keys that you want to use as a filter. Each shard participating in the distributed request will execute a query against the remote collection. If the local collection is setup with the compositeId router to be routed on the join key field, a hash range query is applied to the remote collection query to only match the documents that contain a potential match for the documents that are in the local shard/core. Here's some vocab to help with the descriptions of the various parameters. ||Term||Description|| |Local Collection|This is the main collection that is being queried.| |Remote Collection|This is the collection that the XCJFQuery will query to resolve the join keys.| |XCJFQuery|The lucene query that executes a search to get back a set of join keys from a remote collection| |HashRangeQuery|The lucene query that matches only the documents whose hash code on a field falls within a specified range.| ||Param ||Required ||Description|| |collection|Required|The name of the external Solr collection to be queried to retrieve the set of join key values ( required )| |zkHost|Optional|The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. ( optional )| |solrUrl|Optional|The URL of the external Solr node to be queried ( optional )| |from|Required|The join key field name in the external collection ( required )| |to|Required|The join key field name in the local collection| |v|See Note|The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the "v" parameter. It's recommended to use query parameter substitution with the "v" parameter to ensure no issues arise with the default query parsers.| |routed| |true / false. If true, the XCJF query will use each shard's hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the XCJF query will try to determine the correct value automatically.| |ttl| |The length of time that an XCJF query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The XCJF query will not be aware of changes to the remote collection, so if the remote collection is updated, cached XCJF queries may give inaccurate results. After the ttl period has expired, the XCJF query will re-execute the join against the remote collection.| |_All others_| |Any normal Solr parameter can also be specified as a local param.| Example Solr Config.xml changes: {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}} {{ }}{{class}}{{=}}{{"solr.LRUCache"}} {{ }}{{size}}{{=}}{{"128"}} {{ }}{{initialSize}}{{=}}{{"0"}} {{ }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}} {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}} {{ }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}} {{}} {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} {{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} {{/>}} Example Usage: {{{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} {{to=}}{{"toField"}} {{v=}}{{"**:**"}}{{}}} was: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query
[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )
[ https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-13749: - Description: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query parser will do an intersection based on join keys between 2 collections. The local collection is the collection that you are searching against. The remote collection is the collection that contains the join keys that you want to use as a filter. Each shard participating in the distributed request will execute a query against the remote collection. If the local collection is setup with the compositeId router to be routed on the join key field, a hash range query is applied to the remote collection query to only match the documents that contain a potential match for the documents that are in the local shard/core. Here's some vocab to help with the descriptions of the various parameters. ||Term||Description|| |Local Collection|This is the main collection that is being queried.| |Remote Collection|This is the collection that the XCJFQuery will query to resolve the join keys.| |XCJFQuery|The lucene query that executes a search to get back a set of join keys from a remote collection| |HashRangeQuery|The lucene query that matches only the documents whose hash code on a field falls within a specified range.| ||Param ||Default ||Required ||Description || |collection| |Required|The name of the external Solr collection to be queried to retrieve the set of join key values ( required )| |zkHost| |Optional|The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. ( optional )| |solrUrl| |Optional|The URL of the external Solr node to be queried ( optional )| |from| |Required|The join key field name in the external collection ( required )| |to| |Required|The join key field name in the local collection| |v| |See Note|The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the "v" parameter. It's recommended to use query parameter substitution with the "v" parameter to ensure no issues arise with the default query parsers.| |routed|See Notes| |true / false. If true, the XCJF query will use each shard's hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the XCJF query will try to determine the correct value automatically.| |ttl|3600| |The length of time that an XCJF query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The XCJF query will not be aware of changes to the remote collection, so if the remote collection is updated, cached XCJF queries may give inaccurate results. After the ttl period has expired, the XCJF query will re-execute the join against the remote collection.| |_All others_| | |Any normal Solr parameter can also be specified as a local param.| Example Solr Config.xml changes: {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}} {{ }}{{class}}{{=}}{{"solr.LRUCache"}} {{ }}{{size}}{{=}}{{"128"}} {{ }}{{initialSize}}{{=}}{{"0"}} {{ }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}} {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}} {{ }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}} {{}} {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} {{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} {{/>}} Example Usage: {{{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} {{to=}}{{"toField"}} {{v=}}{{"\**:\**"}}{{}}} was: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to
[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )
[ https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-13749: - Description: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query parser will do an intersection based on join keys between 2 collections. The local collection is the collection that you are searching against. The remote collection is the collection that contains the join keys that you want to use as a filter. Each shard participating in the distributed request will execute a query against the remote collection. If the local collection is setup with the compositeId router to be routed on the join key field, a hash range query is applied to the remote collection query to only match the documents that contain a potential match for the documents that are in the local shard/core. Here's some vocab to help with the descriptions of the various parameters. ||Term||Description|| |Local Collection|This is the main collection that is being queried.| |Remote Collection|This is the collection that the XCJFQuery will query to resolve the join keys.| |XCJFQuery|The lucene query that executes a search to get back a set of join keys from a remote collection| |HashRangeQuery|The lucene query that matches only the documents whose hash code on a field falls within a specified range.| ||Param ||Required ||Description || |collection|Required|The name of the external Solr collection to be queried to retrieve the set of join key values ( required )| |zkHost|Optional|The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. ( optional )| |solrUrl|Optional|The URL of the external Solr node to be queried ( optional )| |from|Required|The join key field name in the external collection ( required )| |to|Required|The join key field name in the local collection| |v|See Note|The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the "v" parameter. It's recommended to use query parameter substitution with the "v" parameter to ensure no issues arise with the default query parsers.| |routed| |true / false. If true, the XCJF query will use each shard's hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the XCJF query will try to determine the correct value automatically.| |ttl| |The length of time that an XCJF query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The XCJF query will not be aware of changes to the remote collection, so if the remote collection is updated, cached XCJF queries may give inaccurate results. After the ttl period has expired, the XCJF query will re-execute the join against the remote collection.| |_All others_| |Any normal Solr parameter can also be specified as a local param.| Example Solr Config.xml changes: {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}} {{ }}{{class}}{{=}}{{"solr.LRUCache"}} {{ }}{{size}}{{=}}{{"128"}} {{ }}{{initialSize}}{{=}}{{"0"}} {{ }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}} {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}} {{ }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}} {{}} {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} {{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} {{/>}} Example Usage: {{{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} {{to=}}{{"toField"}} {{v=}}{{"**:**"}}{{}}} was: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query
[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )
[ https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-13749: - Description: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query parser will do an intersection based on join keys between 2 collections. The local collection is the collection that you are searching against. The remote collection is the collection that contains the join keys that you want to use as a filter. Each shard participating in the distributed request will execute a query against the remote collection. If the local collection is setup with the compositeId router to be routed on the join key field, a hash range query is applied to the remote collection query to only match the documents that contain a potential match for the documents that are in the local shard/core. Here's some vocab to help with the descriptions of the various parameters. ||Term||Description|| |Local Collection|This is the main collection that is being queried.| |Remote Collection|This is the collection that the XCJFQuery will query to resolve the join keys.| |XCJFQuery|The lucene query that executes a search to get back a set of join keys from a remote collection| |HashRangeQuery|The lucene query that matches only the documents whose hash code on a field falls within a specified range.| ||Param||Default||Required||Description|| |collection| |Required|The name of the external Solr collection to be queried to retrieve the set of join key values ( required )| |zkHost| |Optional|The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. ( optional )| |solrUrl| |Optional|The URL of the external Solr node to be queried ( optional )| |from| |Required|The join key field name in the external collection ( required )| |to| |Required|The join key field name in the local collection| |v| |See Note|The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the "v" parameter. It's recommended to use query parameter substitution with the "v" parameter to ensure no issues arise with the default query parsers.| |routed|See Notes| |true / false. If true, the XCJF query will use each shard's hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the XCJF query will try to determine the correct value automatically.| |ttl|3600| |The length of time that an XCJF query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The XCJF query will not be aware of changes to the remote collection, so if the remote collection is updated, cached XCJF queries may give inaccurate results. After the ttl period has expired, the XCJF query will re-execute the join against the remote collection.| |_All others_| | |Any normal Solr parameter can also be specified as a local param.| Example Solr Config.xml changes: {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}} {{ }}{{class}}{{=}}{{"solr.LRUCache"}} {{ }}{{size}}{{=}}{{"128"}} {{ }}{{initialSize}}{{=}}{{"0"}} {{ }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}} {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}} {{ }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}} {{}} {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} {{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} {{/>}} Example Usage: {!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} {{to=}}{{"toField"}} {{v=}}{{"*:*"}}{{} was: This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will
[jira] [Created] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )
Kevin Watters created SOLR-13749: Summary: Implement support for joining across collections with multiple shards ( XCJF ) Key: SOLR-13749 URL: https://issues.apache.org/jira/browse/SOLR-13749 Project: Solr Issue Type: New Feature Security Level: Public (Default Security Level. Issues are Public) Reporter: Kevin Watters This ticket includes 2 query parsers. The first one is the "Cross collection join filter" (XCJF) parser. This is the "Cross-collection join filter" query parser. It can do a call out to a remote collection to get a set of join keys to be used as a filter against the local collection. The second one is the Hash Range query parser that you can specify a field name and a hash range, the result is that only the documents that would have hashed to that range will be returned. This query parser will do an intersection based on join keys between 2 collections. The local collection is the collection that you are searching against. The remote collection is the collection that contains the join keys that you want to use as a filter. Each shard participating in the distributed request will execute a query against the remote collection. If the local collection is setup with the compositeId router to be routed on the join key field, a hash range query is applied to the remote collection query to only match the documents that contain a potential match for the documents that are in the local shard/core. Here's some vocab to help with the descriptions of the various parameters. ||Term||Description|| |Local Collection|This is the main collection that is being queried.| |Remote Collection|This is the collection that the XCJFQuery will query to resolve the join keys.| |XCJFQuery|The lucene query that executes a search to get back a set of join keys from a remote collection| |HashRangeQuery|The lucene query that matches only the documents whose hash code on a field falls within a specified range.| ||Param||Default||Required||Description|| |collection| |Required|The name of the external Solr collection to be queried to retrieve the set of join key values ( required )| |zkHost| |Optional|The connection string to be used to connect to Zookeeper. zkHost and solrUrl are both optional parameters, and at most one of them should be specified. If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will be used. ( optional )| |solrUrl| |Optional|The URL of the external Solr node to be queried ( optional )| |from| |Required|The join key field name in the external collection ( required )| |to| |Required|The join key field name in the local collection| |v| |See Note|The query to be executed against the external Solr collection to retrieve the set of join key values. Note: The original query can be passed at the end of the string or as the "v" parameter. It's recommended to use query parameter substitution with the "v" parameter to ensure no issues arise with the default query parsers.| |routed|See Notes| |true / false. If true, the XCJF query will use each shard's hash range to determine the set of join keys to retrieve for that shard. This parameter improves the performance of the cross-collection join, but it depends on the local collection being routed by the toField. If this parameter is not specified, the XCJF query will try to determine the correct value automatically.| |ttl|3600| |The length of time that an XCJF query in the cache will be considered valid, in seconds. Defaults to 3600 (one hour). The XCJF query will not be aware of changes to the remote collection, so if the remote collection is updated, cached XCJF queries may give inaccurate results. After the ttl period has expired, the XCJF query will re-execute the join against the remote collection.| |_All others_| | |Any normal Solr parameter can also be specified as a local param.| Example Solr Config.xml changes: {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}} {{ }}{{class}}{{=}}{{"solr.LRUCache"}} {{ }}{{size}}{{=}}{{"128"}} {{ }}{{initialSize}}{{=}}{{"0"}} {{ }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}} {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}} {{ }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}} {{}} {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} {{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} {{/>}} Example Usage: {!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} {{to=}}{{"toField"}} {{v=}}{{"*:*"}}{{} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Commented] (SOLR-11384) add support for distributed graph query
[ https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916717#comment-16916717 ] Kevin Watters commented on SOLR-11384: -- [~erickerickson] Streaming expressions are fundamentally different in their semantics to the graph query. If there is renewed interested in this functionality, we can revisit it. At the moment, we're in the process of building a new cross collection join operator (XCJF cross-collection join filter). The work there is a stepping stone for a fully distributed graph traversal. [~komal_vmware] if you have a use case, let's chat about it. I do have a version of the distributed graph query working locally, but I don't consider it prime time due to a few pesky items related to caching. > add support for distributed graph query > --- > > Key: SOLR-11384 > URL: https://issues.apache.org/jira/browse/SOLR-11384 > Project: Solr > Issue Type: Improvement >Reporter: Kevin Watters >Priority: Minor > > Creating this ticket to track the work that I've done on the distributed > graph traversal support in solr. > Current GraphQuery will only work on a single core, which introduces some > limits on where it can be used and also complexities if you want to scale it. > I believe there's a strong desire to support a fully distributed method of > doing the Graph Query. I'm working on a patch, it's not complete yet, but if > anyone would like to have a look at the approach and implementation, I > welcome much feedback. > The flow for the distributed graph query is almost exactly the same as the > normal graph query. The only difference is how it discovers the "frontier > query" at each level of the traversal. > When a distribute graph query request comes in, each shard begins by running > the root query, to know where to start on it's shard. Each participating > shard then discovers it's edges for the next hop. Those edges are then > broadcast to all other participating shards. The shard then receives all the > parts of the frontier query , assembles it, and executes it. > This process continues on each shard until there are no new edges left, or > the maxDepth of the traversal has finished. > The approach is to introduce a FrontierBroker that resides as a singleton on > each one of the solr nodes in the cluster. When a graph query is created, it > can do a getInstance() on it so it can listen on the frontier parts coming in. > Initially, I was using an external Kafka broker to handle this, and it did > work pretty well. The new approach is migrating the FrontierBroker to be a > request handler in Solr, and potentially to use the SolrJ client to publish > the edges to each node in the cluster. > There are a few outstanding design questions, first being, how do we know > what the list of shards are that are participating in the current query > request? Is that easy info to get at? > Second, currently, we are serializing a query object between the shards, > perhaps we should consider a slightly different abstraction, and serialize > lists of "edge" objects between the nodes. The point of this would be to > batch the exploration/traversal of current frontier to help avoid large > bursts of memory being required. > Thrid, what sort of caching strategy should be introduced for the frontier > queries, if any? And if we do some caching there, how/when should the > entries be expired and auto-warmed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12328) Adding graph json facet domain change
[ https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475938#comment-16475938 ] Kevin Watters commented on SOLR-12328: -- Hey Dan , this looks pretty awesome. One comment, If the traversal filter is null/empty, I don't think the default match all query is needed. So, in the GraphField class, I think you can probably get rid of that null check and default value for the traversal filter. > Adding graph json facet domain change > - > > Key: SOLR-12328 > URL: https://issues.apache.org/jira/browse/SOLR-12328 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 7.3 >Reporter: Daniel Meehl >Priority: Major > Attachments: SOLR-12328.patch > > > Json facets now support join queries via domain change. I've made a > relatively small enhancement to add graph to the mix. I'll attach a patch for > your viewing. I'm hoping this can be merged into solr proper. Please let me > know if there are any problems/changes/requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (SOLR-11838) explore supporting Deeplearning4j NeuralNetwork models
[ https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-11838: - Comment: was deleted (was: One small item that I'm coming across here, it would seem that solr is currently using Guava 14.0.. DL4j depends on Guava 20.0. This dependency will break solrj if we integrate DL4j into Solr due to depricated methods in guava 14. Thoughts? Maybe we should update solr for a newer version of guava? (I'm going through the same integration with MyRobotLab now except I'm using an EmbeddedSolrServer at the moment.)) > explore supporting Deeplearning4j NeuralNetwork models > -- > > Key: SOLR-11838 > URL: https://issues.apache.org/jira/browse/SOLR-11838 > Project: Solr > Issue Type: New Feature >Reporter: Christine Poerschke >Priority: Major > Attachments: SOLR-11838.patch, SOLR-11838.patch > > > [~yuyano] wrote in SOLR-11597: > bq. ... If we think to apply this to more complex neural networks in the > future, we will need to support layers ... > [~malcorn_redhat] wrote in SOLR-11597: > bq. ... In my opinion, if this is a route Solr eventually wants to go, I > think a better strategy would be to just add a dependency on > [Deeplearning4j|https://deeplearning4j.org/] ... > Creating this ticket for the idea to be explored further (if anyone is > interested in exploring it), complimentary to and independent of the > SOLR-11597 RankNet related effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11838) explore supporting Deeplearning4j NeuralNetwork models
[ https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368294#comment-16368294 ] Kevin Watters commented on SOLR-11838: -- One small item that I'm coming across here, it would seem that solr is currently using Guava 14.0.. DL4j depends on Guava 20.0. This dependency will break solrj if we integrate DL4j into Solr due to depricated methods in guava 14. Thoughts? Maybe we should update solr for a newer version of guava? (I'm going through the same integration with MyRobotLab now except I'm using an EmbeddedSolrServer at the moment.) > explore supporting Deeplearning4j NeuralNetwork models > -- > > Key: SOLR-11838 > URL: https://issues.apache.org/jira/browse/SOLR-11838 > Project: Solr > Issue Type: New Feature >Reporter: Christine Poerschke >Priority: Major > Attachments: SOLR-11838.patch, SOLR-11838.patch > > > [~yuyano] wrote in SOLR-11597: > bq. ... If we think to apply this to more complex neural networks in the > future, we will need to support layers ... > [~malcorn_redhat] wrote in SOLR-11597: > bq. ... In my opinion, if this is a route Solr eventually wants to go, I > think a better strategy would be to just add a dependency on > [Deeplearning4j|https://deeplearning4j.org/] ... > Creating this ticket for the idea to be explored further (if anyone is > interested in exploring it), complimentary to and independent of the > SOLR-11597 RankNet related effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11838) explore supporting Deeplearning4j NeuralNetwork models
[ https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343602#comment-16343602 ] Kevin Watters commented on SOLR-11838: -- I'm very excited to see this integration happening. [~gus_heck] has been working with me on some DL4j projects in particular training models and evaluating them for classification. I think at a high level there are 3 main integration patterns that we could / should consider in Solr. # using a model at ingest time to tag / annotate a record going into the index. (primary example would be something like sentiment analysis tagging.) This implies the model was trained and saved somewhere. # using a solr index (query) to generate a set of training test data so that DL4j can "fit" the model and train it. (there might even be a desire for some join functionality so you can join together 2 datasets to create adhoc training datasets.) # (this is a bit more out there.) indexing each node of the multi layer network / computation graph as a document in the index, and use a query to evaluate the output of the model by traversing the documents in the index to ultimately come up with a set of relevancy scores for the documents that represent the output layer of the network. I think , perhaps, the most interesting use case is #2. So basically, the idea is you want to define a network (specify the layers, types of layers, activation function, etc..) and then specify a query that matches a set of documents in the index that would be used to train that model. Currently DL4j uses "datavec" to handle all the data normalization prior to going into the model for training. That exposes a DataSetIterator. The datasetiterator could be replaced with an iterator that sits ontop of the export handler or even just a raw search result. The general use cases here for pagination would be # to iterate the full result set (presumably multiple times as the model will make multiple passes over the data when training.) # generate a random ordering of the dataset being returned # excluding a random (but deterministic?) set of documents from the main query to provide a holdout testing dataset. Keeping in mind that typically in network training, you have both your training dataset and the testing dataset. The final outcome of this would be a computationgraph/multilayernetwork which can be serialized by dl4j as a json file, and the other output could/should be the evaluation or accuracy scores of the model (F1, Accuracy, and confusion matrix.) As per the comments about natives, yes, there are definitely platform dependent parts of DL4j, in particular the "nd4j" which can be gpu/cpu, but there are also other dependencies on javacv/javacpp. The javacv/javacpp stuff is really only used for image manipulation as it's the java binding to OpenCV. The dependency tree for DL4j is rather large, so I think we'll need to take care/caution that we're not injecting a bunch of conflicting jar files. Perhaps, if we identify the conflicting jar versions. > explore supporting Deeplearning4j NeuralNetwork models > -- > > Key: SOLR-11838 > URL: https://issues.apache.org/jira/browse/SOLR-11838 > Project: Solr > Issue Type: New Feature >Reporter: Christine Poerschke >Priority: Major > Attachments: SOLR-11838.patch > > > [~yuyano] wrote in SOLR-11597: > bq. ... If we think to apply this to more complex neural networks in the > future, we will need to support layers ... > [~malcorn_redhat] wrote in SOLR-11597: > bq. ... In my opinion, if this is a route Solr eventually wants to go, I > think a better strategy would be to just add a dependency on > [Deeplearning4j|https://deeplearning4j.org/] ... > Creating this ticket for the idea to be explored further (if anyone is > interested in exploring it), complimentary to and independent of the > SOLR-11597 RankNet related effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11384) add support for distributed graph query
[ https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-11384: - Issue Type: Improvement (was: Bug) > add support for distributed graph query > --- > > Key: SOLR-11384 > URL: https://issues.apache.org/jira/browse/SOLR-11384 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Kevin Watters >Priority: Minor > > Creating this ticket to track the work that I've done on the distributed > graph traversal support in solr. > Current GraphQuery will only work on a single core, which introduces some > limits on where it can be used and also complexities if you want to scale it. > I believe there's a strong desire to support a fully distributed method of > doing the Graph Query. I'm working on a patch, it's not complete yet, but if > anyone would like to have a look at the approach and implementation, I > welcome much feedback. > The flow for the distributed graph query is almost exactly the same as the > normal graph query. The only difference is how it discovers the "frontier > query" at each level of the traversal. > When a distribute graph query request comes in, each shard begins by running > the root query, to know where to start on it's shard. Each participating > shard then discovers it's edges for the next hop. Those edges are then > broadcast to all other participating shards. The shard then receives all the > parts of the frontier query , assembles it, and executes it. > This process continues on each shard until there are no new edges left, or > the maxDepth of the traversal has finished. > The approach is to introduce a FrontierBroker that resides as a singleton on > each one of the solr nodes in the cluster. When a graph query is created, it > can do a getInstance() on it so it can listen on the frontier parts coming in. > Initially, I was using an external Kafka broker to handle this, and it did > work pretty well. The new approach is migrating the FrontierBroker to be a > request handler in Solr, and potentially to use the SolrJ client to publish > the edges to each node in the cluster. > There are a few outstanding design questions, first being, how do we know > what the list of shards are that are participating in the current query > request? Is that easy info to get at? > Second, currently, we are serializing a query object between the shards, > perhaps we should consider a slightly different abstraction, and serialize > lists of "edge" objects between the nodes. The point of this would be to > batch the exploration/traversal of current frontier to help avoid large > bursts of memory being required. > Thrid, what sort of caching strategy should be introduced for the frontier > queries, if any? And if we do some caching there, how/when should the > entries be expired and auto-warmed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11384) add support for distributed graph query
[ https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-11384: - Priority: Minor (was: Major) > add support for distributed graph query > --- > > Key: SOLR-11384 > URL: https://issues.apache.org/jira/browse/SOLR-11384 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Kevin Watters >Priority: Minor > > Creating this ticket to track the work that I've done on the distributed > graph traversal support in solr. > Current GraphQuery will only work on a single core, which introduces some > limits on where it can be used and also complexities if you want to scale it. > I believe there's a strong desire to support a fully distributed method of > doing the Graph Query. I'm working on a patch, it's not complete yet, but if > anyone would like to have a look at the approach and implementation, I > welcome much feedback. > The flow for the distributed graph query is almost exactly the same as the > normal graph query. The only difference is how it discovers the "frontier > query" at each level of the traversal. > When a distribute graph query request comes in, each shard begins by running > the root query, to know where to start on it's shard. Each participating > shard then discovers it's edges for the next hop. Those edges are then > broadcast to all other participating shards. The shard then receives all the > parts of the frontier query , assembles it, and executes it. > This process continues on each shard until there are no new edges left, or > the maxDepth of the traversal has finished. > The approach is to introduce a FrontierBroker that resides as a singleton on > each one of the solr nodes in the cluster. When a graph query is created, it > can do a getInstance() on it so it can listen on the frontier parts coming in. > Initially, I was using an external Kafka broker to handle this, and it did > work pretty well. The new approach is migrating the FrontierBroker to be a > request handler in Solr, and potentially to use the SolrJ client to publish > the edges to each node in the cluster. > There are a few outstanding design questions, first being, how do we know > what the list of shards are that are participating in the current query > request? Is that easy info to get at? > Second, currently, we are serializing a query object between the shards, > perhaps we should consider a slightly different abstraction, and serialize > lists of "edge" objects between the nodes. The point of this would be to > batch the exploration/traversal of current frontier to help avoid large > bursts of memory being required. > Thrid, what sort of caching strategy should be introduced for the frontier > queries, if any? And if we do some caching there, how/when should the > entries be expired and auto-warmed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-11384) add support for distributed graph query
Kevin Watters created SOLR-11384: Summary: add support for distributed graph query Key: SOLR-11384 URL: https://issues.apache.org/jira/browse/SOLR-11384 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Kevin Watters Creating this ticket to track the work that I've done on the distributed graph traversal support in solr. Current GraphQuery will only work on a single core, which introduces some limits on where it can be used and also complexities if you want to scale it. I believe there's a strong desire to support a fully distributed method of doing the Graph Query. I'm working on a patch, it's not complete yet, but if anyone would like to have a look at the approach and implementation, I welcome much feedback. The flow for the distributed graph query is almost exactly the same as the normal graph query. The only difference is how it discovers the "frontier query" at each level of the traversal. When a distribute graph query request comes in, each shard begins by running the root query, to know where to start on it's shard. Each participating shard then discovers it's edges for the next hop. Those edges are then broadcast to all other participating shards. The shard then receives all the parts of the frontier query , assembles it, and executes it. This process continues on each shard until there are no new edges left, or the maxDepth of the traversal has finished. The approach is to introduce a FrontierBroker that resides as a singleton on each one of the solr nodes in the cluster. When a graph query is created, it can do a getInstance() on it so it can listen on the frontier parts coming in. Initially, I was using an external Kafka broker to handle this, and it did work pretty well. The new approach is migrating the FrontierBroker to be a request handler in Solr, and potentially to use the SolrJ client to publish the edges to each node in the cluster. There are a few outstanding design questions, first being, how do we know what the list of shards are that are participating in the current query request? Is that easy info to get at? Second, currently, we are serializing a query object between the shards, perhaps we should consider a slightly different abstraction, and serialize lists of "edge" objects between the nodes. The point of this would be to batch the exploration/traversal of current frontier to help avoid large bursts of memory being required. Thrid, what sort of caching strategy should be introduced for the frontier queries, if any? And if we do some caching there, how/when should the entries be expired and auto-warmed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-9415) graph search filter edge
[ https://issues.apache.org/jira/browse/SOLR-9415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428480#comment-15428480 ] Kevin Watters edited comment on SOLR-9415 at 8/19/16 5:20 PM: -- Hello cmd, Are you using the GraphQueryParser? If so, you can add a "traversalFilter" with the query "relationship:College" ... should be something like: !{graph from="name1" to="name2" traversalFilter="+relationship:College_school_classmate +time:[2015-01-01 TO 2016-01-01]"}name1:tom -Kevin was (Author: kwatters): Hello cmd, Are you using the GraphQueryParser? If so, you can add a "traversalFilter" with the query "relationship:College" ... should be something like: !{graph from="name1" to="name2" traversalFilter="relationship:College_school_classmate"}name1:tom -Kevin > graph search filter edge > > > Key: SOLR-9415 > URL: https://issues.apache.org/jira/browse/SOLR-9415 > Project: Solr > Issue Type: Wish > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.1 >Reporter: cmd > Fix For: 6.x > > > currently solr graph hasn't edge concept! for example: > name1(node),name2(node),relationtype,time,other edge attr. > tom,alice,College_school_classmate,2016-10-01 > tom,alice,High_school_classmate,2013-10-01 > tom,alice,middle_school_classmate,2009-10-01 > tom,alice,Primary_school_classmate,2005-10-01 > tom,Smith,College_school_classmate,2016-10-01 > tom,Smith,High_school_classmate,2013-10-01 > tom,Smith,middle_school_classmate,2009-10-01 > tom,Smith,Primary_school_classmate,2005-10-01 > node > tom age:23 sex:male addr: > Smith age:25 sex... > alice . > i want to filter: tom time:[2009 to 2013] and addr: and > relationtype=College is who? > refer: http://graphml.graphdrawing.org/primer/graphml-primer.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9415) graph search filter edge
[ https://issues.apache.org/jira/browse/SOLR-9415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428480#comment-15428480 ] Kevin Watters commented on SOLR-9415: - Hello cmd, Are you using the GraphQueryParser? If so, you can add a "traversalFilter" with the query "relationship:College" ... should be something like: !{graph from="name1" to="name2" traversalFilter="relationship:College_school_classmate"}name1:tom -Kevin > graph search filter edge > > > Key: SOLR-9415 > URL: https://issues.apache.org/jira/browse/SOLR-9415 > Project: Solr > Issue Type: Wish > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.1 >Reporter: cmd > Fix For: 6.x > > > currently solr graph hasn't edge concept! for example: > name1(node),name2(node),relationtype,time,other edge attr. > tom,alice,College_school_classmate,2016-10-01 > tom,alice,High_school_classmate,2013-10-01 > tom,alice,middle_school_classmate,2009-10-01 > tom,alice,Primary_school_classmate,2005-10-01 > tom,Smith,College_school_classmate,2016-10-01 > tom,Smith,High_school_classmate,2013-10-01 > tom,Smith,middle_school_classmate,2009-10-01 > tom,Smith,Primary_school_classmate,2005-10-01 > node > tom age:23 sex:male addr: > Smith age:25 sex... > alice . > i want to filter: tom time:[2009 to 2013] and addr: and > relationtype=College is who? > refer: http://graphml.graphdrawing.org/primer/graphml-primer.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes
[ https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261408#comment-15261408 ] Kevin Watters commented on SOLR-9027: - no specific use case.. but if doc frequency is 0 for a given term in a "node/from" field, there's not much point in traversing it,or querying for it in the first place. I'm not sure if that's even possible, but you might have edges that point to a document that doesn't exist, in such a case, it's an easy optimization to avoid that traversal. (similar to the leafNodesOnly parameter on the GraphQuery.) > Add GraphTermsQuery to limit traversal on high frequency nodes > -- > > Key: SOLR-9027 > URL: https://issues.apache.org/jira/browse/SOLR-9027 > Project: Solr > Issue Type: New Feature >Reporter: Joel Bernstein >Priority: Minor > Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, > SOLR-9027.patch > > > The gatherNodes() Streaming Expression is currently using a basic disjunction > query to perform the traversals. This ticket is to create a specific > GraphTermsQuery for performing the traversals. > The GraphTermsQuery will be based off of the TermsQuery, but will also > include an option for a docFreq cutoff. Terms that are above the docFreq > cutoff will not be included in the query. This will help users do a more > precise and efficient traversal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes
[ https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261311#comment-15261311 ] Kevin Watters commented on SOLR-9027: - Yes, sorry for the typo, minDocFreq :) avoid sparse edges .. could be a useful use case.. (especially in a distributed use case) > Add GraphTermsQuery to limit traversal on high frequency nodes > -- > > Key: SOLR-9027 > URL: https://issues.apache.org/jira/browse/SOLR-9027 > Project: Solr > Issue Type: New Feature >Reporter: Joel Bernstein >Priority: Minor > Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, > SOLR-9027.patch > > > The gatherNodes() Streaming Expression is currently using a basic disjunction > query to perform the traversals. This ticket is to create a specific > GraphTermsQuery for performing the traversals. > The GraphTermsQuery will be based off of the TermsQuery, but will also > include an option for a docFreq cutoff. Terms that are above the docFreq > cutoff will not be included in the query. This will help users do a more > precise and efficient traversal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes
[ https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261007#comment-15261007 ] Kevin Watters commented on SOLR-9027: - Nice stuff Joel! Just a thought, it might be nice to also provide a "maxDocFreq" on the GraphTermsQuery... relatively easy to add now, and it would allow graph traversal that ignore sparse edges... Either way, this is very cool, It seems like this would be a natural enhancement for the GraphQuery when it builds the frontier. > Add GraphTermsQuery to limit traversal on high frequency nodes > -- > > Key: SOLR-9027 > URL: https://issues.apache.org/jira/browse/SOLR-9027 > Project: Solr > Issue Type: New Feature >Reporter: Joel Bernstein >Priority: Minor > Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, > SOLR-9027.patch > > > The gatherNodes() Streaming Expression is currently using a basic disjunction > query to perform the traversals. This ticket is to create a specific > GraphTermsQuery for performing the traversals. > The GraphTermsQuery will be based off of the TermsQuery, but will also > include an option for a docFreq cutoff. Terms that are above the docFreq > cutoff will not be included in the query. This will help users do a more > precise and efficient traversal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212525#comment-15212525 ] Kevin Watters edited comment on SOLR-8176 at 3/29/16 2:19 PM: -- Here's a patch with a basic implementation of a Kafka based frontier query broker to support distributed graph query traversal in Solr. The unit test is commented out because it requires a Kafka broker to be running. Also, there's some config parameters / properties that are hard coded. Either way, this shows how to use the GraphQuery in a distributed graph traversal mode. Disclaimer: this patch isn't intended to be merged, it's really only an example of how to do it. there's a lot of cleanup that still needs to happen to make it ready for primetime. was (Author: kwatters): Here's a patch with a basic implementation of a Kafka based frontier query broker to support distributed graph query traversal in Solr. The unit test is commented out because it requires a Kafka broker to be running. Also, there's some config parameters / properties that are hard coded. Either way, this shows how to use the GraphQuery in a distributed graph traversal mode. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > Attachments: SOLR-8176.patch > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-8176: Attachment: SOLR-8176.patch Here's a patch with a basic implementation of a Kafka based frontier query broker to support distributed graph query traversal in Solr. The unit test is commented out because it requires a Kafka broker to be running. Also, there's some config parameters / properties that are hard coded. Either way, this shows how to use the GraphQuery in a distributed graph traversal mode. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > Attachments: SOLR-8176.patch > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197814#comment-15197814 ] Kevin Watters commented on SOLR-8176: - Hi Gopal, I'm running a little bit behind the times, I'm still working off a branch that was checked out from SVN. I'll update to trunk from git and make sure my local tests are still passing and I'll post a patch after I can clean up my comments and code a little bit. Joel, Thanks for the pointer, I'll have a look at the TopicStream... It might do what we need. If not, maybe we can extend it. I've been focusing on Kafka because it's pretty simple, generic, robust and scales really well. I'm not tied to any particular technology for it, so long as we can publish some objects with a unique topic identifier. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192120#comment-15192120 ] Kevin Watters commented on SOLR-8176: - Hey Guys, I know you're really focusing on streaming expressions for graph traversal, I just wanted to throw it out there. I have a version of it working based on the GraphQuery. It's completely distributed, the only kicker is, I implemented it with a dependency on Kafka as a message broker to handle the shuffling of the frontier query. I was curious if there's a message broker already in the Solr stack, if so, it should be reasonably easy to swap out the kafka dependency and then we'll all have a fully distributed graph traversal in Solr. Let me know what you think, > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8532) Optimize GraphQuery when maxDepth is set
[ https://issues.apache.org/jira/browse/SOLR-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-8532: Attachment: SOLR-8532.patch This is the patch for optimizations for the graph query. > Optimize GraphQuery when maxDepth is set > > > Key: SOLR-8532 > URL: https://issues.apache.org/jira/browse/SOLR-8532 > Project: Solr > Issue Type: Bug >Reporter: Kevin Watters > Attachments: SOLR-8532.patch > > > the current graph query implementation always collects edges. When a > maxDepth is specified, there is an obvious optimization to not collect edges > at the maxDepth level. > In addition there are some other memory optimizations that I'd like to merge > in. I have an updated version that includes the above optimization, in > addition, there are some memory optimizations that can be applied if > returnRoot != false. In that, It doesn't need to hold on to the original > docset that matched the root nodes of the query. > I will be posting the patch in the next few days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-8532) Optimize GraphQuery when maxDepth is set
Kevin Watters created SOLR-8532: --- Summary: Optimize GraphQuery when maxDepth is set Key: SOLR-8532 URL: https://issues.apache.org/jira/browse/SOLR-8532 Project: Solr Issue Type: Bug Reporter: Kevin Watters the current graph query implementation always collects edges. When a maxDepth is specified, there is an obvious optimization to not collect edges at the maxDepth level. In addition there are some other memory optimizations that I'd like to merge in. I have an updated version that includes the above optimization, in addition, there are some memory optimizations that can be applied if returnRoot != false. In that, It doesn't need to hold on to the original docset that matched the root nodes of the query. I will be posting the patch in the next few days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945822#comment-14945822 ] Kevin Watters commented on SOLR-7543: - Nice improvements! The new TermsQuery, that definitely is a nice fit for this type of query. (though that code path is only active if useAutn=false so it doesn't do the automaton compilation. ) Looks good to me, lets roll with it! > Create GraphQuery that allows graph traversal as a query operator. > -- > > Key: SOLR-7543 > URL: https://issues.apache.org/jira/browse/SOLR-7543 > Project: Solr > Issue Type: New Feature > Components: search >Reporter: Kevin Watters >Priority: Minor > Attachments: SOLR-7543.patch, SOLR-7543.patch > > > I have a GraphQuery that I implemented a long time back that allows a user to > specify a "startQuery" to identify which documents to start graph traversal > from. It then gathers up the edge ids for those documents , optionally > applies an additional filter. The query is then re-executed continually > until no new edge ids are identified. I am currently hosting this code up at > https://github.com/kwatters/solrgraph and I would like to work with the > community to get some feedback and ultimately get it committed back in as a > lucene query. > Here's a bit more of a description of the parameters for the query / graph > traversal: > q - the initial start query that identifies the universe of documents to > start traversal from. > fromField - the field name that contains the node id > toField - the name of the field that contains the edge id(s). > traversalFilter - this is an additional query that can be supplied to limit > the scope of graph traversal to just the edges that satisfy the > traversalFilter query. > maxDepth - integer specifying how deep the breadth first search should go. > returnStartNodes - boolean to determine if the documents that matched the > original "q" should be returned as part of the graph. > onlyLeafNodes - boolean that filters the graph query to only return > documents/nodes that have no edges. > We identify a set of documents with "q" as any arbitrary lucene query. It > will collect the values in the fromField, create an OR query with those > values , optionally apply an additional constraint from the "traversalFilter" > and walk the result set until no new edges are detected. Traversal can also > be stopped at N hops away as defined with the maxDepth. This is a BFS > (Breadth First Search) algorithm. Cycle detection is done by not revisiting > the same document for edge extraction. > This query operator does not keep track of how you arrived at the document, > but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-7543: Attachment: SOLR-7543.patch Patch with GraphQuery / parsers / unit tests. > Create GraphQuery that allows graph traversal as a query operator. > -- > > Key: SOLR-7543 > URL: https://issues.apache.org/jira/browse/SOLR-7543 > Project: Solr > Issue Type: New Feature > Components: search >Reporter: Kevin Watters >Priority: Minor > Attachments: SOLR-7543.patch > > > I have a GraphQuery that I implemented a long time back that allows a user to > specify a "startQuery" to identify which documents to start graph traversal > from. It then gathers up the edge ids for those documents , optionally > applies an additional filter. The query is then re-executed continually > until no new edge ids are identified. I am currently hosting this code up at > https://github.com/kwatters/solrgraph and I would like to work with the > community to get some feedback and ultimately get it committed back in as a > lucene query. > Here's a bit more of a description of the parameters for the query / graph > traversal: > q - the initial start query that identifies the universe of documents to > start traversal from. > fromField - the field name that contains the node id > toField - the name of the field that contains the edge id(s). > traversalFilter - this is an additional query that can be supplied to limit > the scope of graph traversal to just the edges that satisfy the > traversalFilter query. > maxDepth - integer specifying how deep the breadth first search should go. > returnStartNodes - boolean to determine if the documents that matched the > original "q" should be returned as part of the graph. > onlyLeafNodes - boolean that filters the graph query to only return > documents/nodes that have no edges. > We identify a set of documents with "q" as any arbitrary lucene query. It > will collect the values in the fromField, create an OR query with those > values , optionally apply an additional constraint from the "traversalFilter" > and walk the result set until no new edges are detected. Traversal can also > be stopped at N hops away as defined with the maxDepth. This is a BFS > (Breadth First Search) algorithm. Cycle detection is done by not revisiting > the same document for edge extraction. > This query operator does not keep track of how you arrived at the document, > but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550544#comment-14550544 ] Kevin Watters commented on SOLR-7543: - [~ysee...@gmail.com] , right now, it builds against 4.x If I submit a patch, should be done for trunk, or is a 4.x branch ok? I'm just finishing up the unit tests, either way, I hope to have a patch submitted by the end of the week. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549615#comment-14549615 ] Kevin Watters commented on SOLR-7543: - Hi [~steff1193] I agree we want to do all sorts of great types of graph queries. Problem is, as soon as you take the step to maintain metadata about the graph traversal, the memory requirements for such an operation can be huge. The way I see it is there are likely 3 things to do to close the gap: * Make the traversalFilter a more complex datastructure (like an array), to allow different filters at different graph traversal levels. * accumulate a weight field on the traversed edges as part of the relevancy score (currently no ranking is done) * maintain the history of edges that traverse into a node. All of these could be considered for future functionality, but it would really take some re-thinking of how it all works. For now, having the functionality to apply the graph as a filter to the result set is the goal. In many cases, if you nest these graph queries, and the documents are structured properly, you should still be able to achieve the result that you desire, but we'd have to take that on a case by case basis. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549652#comment-14549652 ] Kevin Watters commented on SOLR-7543: - Ok, my initial graph query parser is handling the following syntax {!graph from=node_field to=edge_field returnRoot=true returnOnlyLeaf=false maxDepth=-1 traversalFilter=foo:bar}id:doc_8 The above would start traversal at doc_8 and only walk nodes that have a field foo containing the value bar. This seems to be (more) consistent with the rest of the query parsers. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545546#comment-14545546 ] Kevin Watters commented on SOLR-7543: - Interesting Dennis, I wasn't aware of SOLR-7377, I'll have to take a bit more time to understand what that means in the context of the graph query. I'm not sure how cross collection graph traversal will play with my implementation. Issue is, my lucene graph query currently is local to a single shard/core. I have been chatting with [~joel.bernstein] about the distributed graph traversal use case, and I think there is a play for streaming aggregation there. There is one line that needs to be coordinated/synchronized across the cluster to do the distributed graph traversal, I think that's where the streaming stuff comes in. I like the idea of renaming returnStartNodes to returnRoot ... less words and hopefully more descriptive of what is happening, (same for returnOnlyLeaf .. ) maybe the word nodes is redundant, and it obscures that it's really just a document at the end of the day. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546031#comment-14546031 ] Kevin Watters commented on SOLR-7543: - [~steff1193] My mantra here is relates to something I heard once. ??A graph is a filter on top of your data.-someone?? So, I'm offering this implementation up to solve that use case. Analytics on top of that graph would be acheived via faceting or streaming aggregation. Maybe there's something that Titan could leverage from this implementation? There are some starting plans on doing a distributed version of this query operator. [~dpgove] Interesting syntax. The usecase of children 4 isn't currently supported in my impl. My impl doesn't have any history of the paths through the graph. It only has the bitset that represents the matched documents. I wanted to keep it as lean as possible. We could start keeping around additional data structures during the traversal to count, but that can get very expensive very quickly. My goal/desire here is to keep the memory usage to one bitset. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Watters updated SOLR-7543: Description: I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. was:I have a GraphQuery that I implemented a long time back that allows a user to specify a seedQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived
[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
[ https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543982#comment-14543982 ] Kevin Watters commented on SOLR-7543: - Hi Yonik, thanks for chiming in! Yup, you can think of this as a multi-step join. In fact, I use the graph operator with a maxDepth of 1 to implement an inner join. I like things to be consistent (it's easier for others to grok that way), we can rename the fromField and the toField to be from and to. When it comes to the GraphQueryParser(Plugin), I'm open to whatever the community likes and whatever is consistent with the other parsers out there. (I've always been a bit thrown by the !parser_name syntax, which is why I also have a client side object model so that I programmatically build up an expression, I serialize that expression over to my custom parser that deserializes and converts into the appropriate lucene query objects. ). I suppose I just want to make sure that the v=my_start_query can be any arbitrary lucene query. I also still need to work up some richer examples and test cases as part of this ticket. Create GraphQuery that allows graph traversal as a query operator. -- Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a startQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. Here's a bit more of a description of the parameters for the query / graph traversal: q - the initial start query that identifies the universe of documents to start traversal from. fromField - the field name that contains the node id toField - the name of the field that contains the edge id(s). traversalFilter - this is an additional query that can be supplied to limit the scope of graph traversal to just the edges that satisfy the traversalFilter query. maxDepth - integer specifying how deep the breadth first search should go. returnStartNodes - boolean to determine if the documents that matched the original q should be returned as part of the graph. onlyLeafNodes - boolean that filters the graph query to only return documents/nodes that have no edges. We identify a set of documents with q as any arbitrary lucene query. It will collect the values in the fromField, create an OR query with those values , optionally apply an additional constraint from the traversalFilter and walk the result set until no new edges are detected. Traversal can also be stopped at N hops away as defined with the maxDepth. This is a BFS (Breadth First Search) algorithm. Cycle detection is done by not revisiting the same document for edge extraction. This query operator does not keep track of how you arrived at the document, but only that the traversal did arrive at the document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.
Kevin Watters created SOLR-7543: --- Summary: Create GraphQuery that allows graph traversal as a query operator. Key: SOLR-7543 URL: https://issues.apache.org/jira/browse/SOLR-7543 Project: Solr Issue Type: New Feature Components: search Reporter: Kevin Watters Priority: Minor I have a GraphQuery that I implemented a long time back that allows a user to specify a seedQuery to identify which documents to start graph traversal from. It then gathers up the edge ids for those documents , optionally applies an additional filter. The query is then re-executed continually until no new edge ids are identified. I am currently hosting this code up at https://github.com/kwatters/solrgraph and I would like to work with the community to get some feedback and ultimately get it committed back in as a lucene query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4787) Join Contrib
[ https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652997#comment-13652997 ] Kevin Watters commented on SOLR-4787: - Hey Joel, It was good to meet you at the conference last week. We talked a little bit about my GraphQuery operator. The use case of a 1 level graph traversal can accomplish a post filter join request. The caviot is that you won't know which record was joined to, only that it did satisfy the join requirement. I could contribute it here, or perhaps we could create a Graph Contrib ticket? Thanks, -Kevin Join Contrib Key: SOLR-4787 URL: https://issues.apache.org/jira/browse/SOLR-4787 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.2.1 Reporter: Joel Bernstein Priority: Minor Fix For: 4.2.1 Attachments: SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch This contrib provides a place where different join implementations can be contributed to Solr. This contrib currently includes 2 join implementations. The initial patch was generated from the Solr 4.2.1 tag. Because of changes in the FieldCache API this patch will only build with Solr 4.2 or above. *PostFilterJoinQParserPlugin aka pjoin* The pjoin provides a join implementation that filters results in one core based on the results of a search in another core. This is similar in functionality to the JoinQParserPlugin but the implementation differs in a couple of important ways. The first way is that the pjoin is designed to work with integer join keys only. So, in order to use pjoin, integer join keys must be included in both the to and from core. The second difference is that the pjoin builds memory structures that are used to quickly connect the join keys. It also uses a custom SolrCache named join to hold intermediate DocSets which are needed to build the join memory structures. So, the pjoin will need more memory then the JoinQParserPlugin to perform the join. The main advantage of the pjoin is that it can scale to join millions of keys between cores. Because it's a PostFilter, it only needs to join records that match the main query. The syntax of the pjoin is the same as the JoinQParserPlugin except that the plugin is referenced by the string pjoin rather then join. fq=\{!pjoin fromCore=collection2 from=id_i to=id_i\}user:customer1 The example filter query above will search the fromCore (collection2) for user:customer1. This query will generate a list of values from the from field that will be used to filter the main query. Only records from the main query, where the to field is present in the from list will be included in the results. The solrconfig.xml in the main query core must contain the reference to the pjoin. queryParser name=pjoin class=org.apache.solr.joins.PostFilterJoinQParserPlugin/ And the join contrib jars must be registed in the solrconfig.xml. lib dir=../../../dist/ regex=solr-joins-\d.*\.jar / The solrconfig.xml in the fromcore must have the join SolrCache configured. cache name=join class=solr.LRUCache size=4096 initialSize=1024 / *JoinValueSourceParserPlugin aka vjoin* The second implementation is the JoinValueSourceParserPlugin aka vjoin. This implements a ValueSource function query that can return values from a second core based on join keys. This allows relevance data to be stored in a separate core and then joined in the main query. The vjoin is called using the vjoin function query. For example: bf=vjoin(fromCore, fromKey, fromVal, toKey) This example shows vjoin being called by the edismax boost function parameter. This example will return the fromVal from the fromCore. The fromKey and toKey are used to link the records from the main query to the records in the fromCore. As with the pjoin, both the fromKey and toKey must be integers. Also like the pjoin, the join SolrCache is used to hold the join memory structures. To configure the vjoin you must register the ValueSource plugin in the solrconfig.xml as follows: valueSourceParser name=vjoin class=org.apache.solr.joins.JoinValueSourceParserPlugin / -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org