[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2019-09-09 Thread Kevin Watters (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-13749:
-
Description: 
This ticket includes 2 query parsers.

The first one is the "Cross collection join filter"  (XCJF) parser. This is the 
"Cross-collection join filter" query parser. It can do a call out to a remote 
collection to get a set of join keys to be used as a filter against the local 
collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query parser will do an intersection based on join keys between 2 
collections.

The local collection is the collection that you are searching against.

The remote collection is the collection that contains the join keys that you 
want to use as a filter.

Each shard participating in the distributed request will execute a query 
against the remote collection.  If the local collection is setup with the 
compositeId router to be routed on the join key field, a hash range query is 
applied to the remote collection query to only match the documents that contain 
a potential match for the documents that are in the local shard/core.  

 

Here's some vocab to help with the descriptions of the various parameters.
||Term||Description||
|Local Collection|This is the main collection that is being queried.|
|Remote Collection|This is the collection that the XCJFQuery will query to 
resolve the join keys.|
|XCJFQuery|The lucene query that executes a search to get back a set of join 
keys from a remote collection|
|HashRangeQuery|The lucene query that matches only the documents whose hash 
code on a field falls within a specified range.|

 

 
||Param ||Required ||Description||
|collection|Required|The name of the external Solr collection to be queried to 
retrieve the set of join key values ( required )|
|zkHost|Optional|The connection string to be used to connect to Zookeeper.  
zkHost and solrUrl are both optional parameters, and at most one of them should 
be specified.  
If neither of zkHost or solrUrl are specified, the local Zookeeper cluster will 
be used. ( optional )|
|solrUrl|Optional|The URL of the external Solr node to be queried ( optional )|
|from|Required|The join key field name in the external collection ( required )|
|to|Required|The join key field name in the local collection|
|v|See Note|The query to be executed against the external Solr collection to 
retrieve the set of join key values.  
Note:  The original query can be passed at the end of the string or as the "v" 
parameter.  
It's recommended to use query parameter substitution with the "v" parameter 
to ensure no issues arise with the default query parsers.|
|routed| |true / false.  If true, the XCJF query will use each shard's hash 
range to determine the set of join keys to retrieve for that shard.
This parameter improves the performance of the cross-collection join, but 
it depends on the local collection being routed by the toField.  If this 
parameter is not specified, 
the XCJF query will try to determine the correct value automatically.|
|ttl| |The length of time that an XCJF query in the cache will be considered 
valid, in seconds.  Defaults to 3600 (one hour).  
The XCJF query will not be aware of changes to the remote collection, so 
if the remote collection is updated, cached XCJF queries may give inaccurate 
results.  
After the ttl period has expired, the XCJF query will re-execute the join 
against the remote collection.|
|_All others_| |Any normal Solr parameter can also be specified as a local 
param.|

 

Example Solr Config.xml changes:

 
 {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
 {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
 {{   }}{{size}}{{=}}{{"128"}}
 {{   }}{{initialSize}}{{=}}{{"0"}}
 {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
  
 {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
 {{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
 {{}}
  
 {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} 
{{/>}}
  

Example Usage:

{{{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} 
{{to=}}{{"toField"}} {{v=}}{{"**:**"}}{{}}}
  
  

 

 

 

  was:
This ticket includes 2 query parsers.

The first one is the "Cross collection join filter"  (XCJF) parser. This is the 
"Cross-collection join filter" query parser. It can do a call out to a remote 
collection to get a set of join keys to be used as a filter against the local 
collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query 

[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2019-09-09 Thread Kevin Watters (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-13749:
-
Description: 
This ticket includes 2 query parsers.

The first one is the "Cross collection join filter"  (XCJF) parser. This is the 
"Cross-collection join filter" query parser. It can do a call out to a remote 
collection to get a set of join keys to be used as a filter against the local 
collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query parser will do an intersection based on join keys between 2 
collections.

The local collection is the collection that you are searching against.

The remote collection is the collection that contains the join keys that you 
want to use as a filter.

Each shard participating in the distributed request will execute a query 
against the remote collection.  If the local collection is setup with the 
compositeId router to be routed on the join key field, a hash range query is 
applied to the remote collection query to only match the documents that contain 
a potential match for the documents that are in the local shard/core.  

 

Here's some vocab to help with the descriptions of the various parameters.
||Term||Description||
|Local Collection|This is the main collection that is being queried.|
|Remote Collection|This is the collection that the XCJFQuery will query to 
resolve the join keys.|
|XCJFQuery|The lucene query that executes a search to get back a set of join 
keys from a remote collection|
|HashRangeQuery|The lucene query that matches only the documents whose hash 
code on a field falls within a specified range.|

 

 
||Param ||Default ||Required ||Description ||
|collection| |Required|The name of the external Solr collection to be queried 
to retrieve the set of join key values ( required )|
|zkHost| |Optional|The connection string to be used to connect to Zookeeper.  
zkHost and solrUrl are both optional parameters, and at most one of them should 
be specified.  If neither of zkHost or solrUrl are specified, the local 
Zookeeper cluster will be used. ( optional )|
|solrUrl| |Optional|The URL of the external Solr node to be queried ( optional 
)|
|from| |Required|The join key field name in the external collection ( required 
)|
|to| |Required|The join key field name in the local collection|
|v| |See Note|The query to be executed against the external Solr collection to 
retrieve the set of join key values.  Note:  The original query can be passed 
at the end of the string or as the "v" parameter.  It's recommended to use 
query parameter substitution with the "v" parameter to ensure no issues arise 
with the default query parsers.|
|routed|See Notes| |true / false.  If true, the XCJF query will use each 
shard's hash range to determine the set of join keys to retrieve for that 
shard.  This parameter improves the performance of the cross-collection join, 
but it depends on the local collection being routed by the toField.  If this 
parameter is not specified, the XCJF query will try to determine the correct 
value automatically.|
|ttl|3600| |The length of time that an XCJF query in the cache will be 
considered valid, in seconds.  Defaults to 3600 (one hour).  The XCJF query 
will not be aware of changes to the remote collection, so if the remote 
collection is updated, cached XCJF queries may give inaccurate results.  After 
the ttl period has expired, the XCJF query will re-execute the join against the 
remote collection.|
|_All others_| | |Any normal Solr parameter can also be specified as a local 
param.|

 

Example Solr Config.xml changes:

 
 {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
 {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
 {{   }}{{size}}{{=}}{{"128"}}
 {{   }}{{initialSize}}{{=}}{{"0"}}
 {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
  
 {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
 {{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
 {{}}
  
 {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} 
{{/>}}
  

Example Usage:

{{{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} 
{{to=}}{{"toField"}} {{v=}}{{"\**:\**"}}{{}}}
  
  

 

 

 

  was:
This ticket includes 2 query parsers.


 The first one is the "Cross collection join filter"  (XCJF) parser. This is 
the "Cross-collection join filter" query parser. It can do a call out to a 
remote collection to get a set of join keys to be used as a filter against the 
local collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to 

[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2019-09-09 Thread Kevin Watters (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-13749:
-
Description: 
This ticket includes 2 query parsers.

The first one is the "Cross collection join filter"  (XCJF) parser. This is the 
"Cross-collection join filter" query parser. It can do a call out to a remote 
collection to get a set of join keys to be used as a filter against the local 
collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query parser will do an intersection based on join keys between 2 
collections.

The local collection is the collection that you are searching against.

The remote collection is the collection that contains the join keys that you 
want to use as a filter.

Each shard participating in the distributed request will execute a query 
against the remote collection.  If the local collection is setup with the 
compositeId router to be routed on the join key field, a hash range query is 
applied to the remote collection query to only match the documents that contain 
a potential match for the documents that are in the local shard/core.  

 

Here's some vocab to help with the descriptions of the various parameters.
||Term||Description||
|Local Collection|This is the main collection that is being queried.|
|Remote Collection|This is the collection that the XCJFQuery will query to 
resolve the join keys.|
|XCJFQuery|The lucene query that executes a search to get back a set of join 
keys from a remote collection|
|HashRangeQuery|The lucene query that matches only the documents whose hash 
code on a field falls within a specified range.|

 

 
||Param ||Required ||Description ||
|collection|Required|The name of the external Solr collection to be queried to 
retrieve the set of join key values ( required )|
|zkHost|Optional|The connection string to be used to connect to Zookeeper.  
zkHost and solrUrl are both optional parameters, and at most one of them should 
be specified.  If neither of zkHost or solrUrl are specified, the local 
Zookeeper cluster will be used. ( optional )|
|solrUrl|Optional|The URL of the external Solr node to be queried ( optional )|
|from|Required|The join key field name in the external collection ( required )|
|to|Required|The join key field name in the local collection|
|v|See Note|The query to be executed against the external Solr collection to 
retrieve the set of join key values.  Note:  The original query can be passed 
at the end of the string or as the "v" parameter.  It's recommended to use 
query parameter substitution with the "v" parameter to ensure no issues arise 
with the default query parsers.|
|routed| |true / false.  If true, the XCJF query will use each shard's hash 
range to determine the set of join keys to retrieve for that shard.  This 
parameter improves the performance of the cross-collection join, but it depends 
on the local collection being routed by the toField.  If this parameter is not 
specified, the XCJF query will try to determine the correct value 
automatically.|
|ttl| |The length of time that an XCJF query in the cache will be considered 
valid, in seconds.  Defaults to 3600 (one hour).  The XCJF query will not be 
aware of changes to the remote collection, so if the remote collection is 
updated, cached XCJF queries may give inaccurate results.  After the ttl period 
has expired, the XCJF query will re-execute the join against the remote 
collection.|
|_All others_| |Any normal Solr parameter can also be specified as a local 
param.|

 

Example Solr Config.xml changes:

 
 {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
 {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
 {{   }}{{size}}{{=}}{{"128"}}
 {{   }}{{initialSize}}{{=}}{{"0"}}
 {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
  
 {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
 {{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
 {{}}
  
 {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} 
{{/>}}
  

Example Usage:

{{{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} 
{{to=}}{{"toField"}} {{v=}}{{"**:**"}}{{}}}
  
  

 

 

 

  was:
This ticket includes 2 query parsers.

The first one is the "Cross collection join filter"  (XCJF) parser. This is the 
"Cross-collection join filter" query parser. It can do a call out to a remote 
collection to get a set of join keys to be used as a filter against the local 
collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query 

[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2019-09-09 Thread Kevin Watters (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-13749:
-
Description: 
This ticket includes 2 query parsers.


 The first one is the "Cross collection join filter"  (XCJF) parser. This is 
the "Cross-collection join filter" query parser. It can do a call out to a 
remote collection to get a set of join keys to be used as a filter against the 
local collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query parser will do an intersection based on join keys between 2 
collections.

The local collection is the collection that you are searching against.

The remote collection is the collection that contains the join keys that you 
want to use as a filter.

Each shard participating in the distributed request will execute a query 
against the remote collection.  If the local collection is setup with the 
compositeId router to be routed on the join key field, a hash range query is 
applied to the remote collection query to only match the documents that contain 
a potential match for the documents that are in the local shard/core.  

 

Here's some vocab to help with the descriptions of the various parameters.
||Term||Description||
|Local Collection|This is the main collection that is being queried.|
|Remote Collection|This is the collection that the XCJFQuery will query to 
resolve the join keys.|
|XCJFQuery|The lucene query that executes a search to get back a set of join 
keys from a remote collection|
|HashRangeQuery|The lucene query that matches only the documents whose hash 
code on a field falls within a specified range.|

 

 
||Param||Default||Required||Description||
|collection| |Required|The name of the external Solr collection to be queried 
to retrieve the set of join key values ( required )|
|zkHost| |Optional|The connection string to be used to connect to Zookeeper.  
zkHost and solrUrl are both optional parameters, and at most one of them should 
be specified.  If neither of zkHost or solrUrl are specified, the local 
Zookeeper cluster will be used. ( optional )|
|solrUrl| |Optional|The URL of the external Solr node to be queried ( optional 
)|
|from| |Required|The join key field name in the external collection ( required 
)|
|to| |Required|The join key field name in the local collection|
|v| |See Note|The query to be executed against the external Solr collection to 
retrieve the set of join key values.  Note:  The original query can be passed 
at the end of the string or as the "v" parameter.  It's recommended to use 
query parameter substitution with the "v" parameter to ensure no issues arise 
with the default query parsers.|
|routed|See Notes| |true / false.  If true, the XCJF query will use each 
shard's hash range to determine the set of join keys to retrieve for that 
shard.  This parameter improves the performance of the cross-collection join, 
but it depends on the local collection being routed by the toField.  If this 
parameter is not specified, the XCJF query will try to determine the correct 
value automatically.|
|ttl|3600| |The length of time that an XCJF query in the cache will be 
considered valid, in seconds.  Defaults to 3600 (one hour).  The XCJF query 
will not be aware of changes to the remote collection, so if the remote 
collection is updated, cached XCJF queries may give inaccurate results.  After 
the ttl period has expired, the XCJF query will re-execute the join against the 
remote collection.|
|_All others_| | |Any normal Solr parameter can also be specified as a local 
param.|

 

Example Solr Config.xml changes:

 
{{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
{{   }}{{class}}{{=}}{{"solr.LRUCache"}}
{{   }}{{size}}{{=}}{{"128"}}
{{   }}{{initialSize}}{{=}}{{"0"}}
{{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
 
{{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
{{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
{{}}
 
{{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} 
{{/>}}
 

Example Usage:

{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} 
{{to=}}{{"toField"}} {{v=}}{{"*:*"}}{{}
 
 

 

 

 

  was:
This ticket includes 2 query parsers.


 The first one is the "Cross collection join filter"  (XCJF) parser. This is 
the "Cross-collection join filter" query parser. It can do a call out to a 
remote collection to get a set of join keys to be used as a filter against the 
local collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will 

[jira] [Created] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2019-09-09 Thread Kevin Watters (Jira)
Kevin Watters created SOLR-13749:


 Summary: Implement support for joining across collections with 
multiple shards ( XCJF )
 Key: SOLR-13749
 URL: https://issues.apache.org/jira/browse/SOLR-13749
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Kevin Watters


This ticket includes 2 query parsers.


 The first one is the "Cross collection join filter"  (XCJF) parser. This is 
the "Cross-collection join filter" query parser. It can do a call out to a 
remote collection to get a set of join keys to be used as a filter against the 
local collection.

The second one is the Hash Range query parser that you can specify a field name 
and a hash range, the result is that only the documents that would have hashed 
to that range will be returned.

This query parser will do an intersection based on join keys between 2 
collections.

The local collection is the collection that you are searching against.

The remote collection is the collection that contains the join keys that you 
want to use as a filter.

Each shard participating in the distributed request will execute a query 
against the remote collection.  If the local collection is setup with the 
compositeId router to be routed on the join key field, a hash range query is 
applied to the remote collection query to only match the documents that contain 
a potential match for the documents that are in the local shard/core.  

 

Here's some vocab to help with the descriptions of the various parameters.
||Term||Description||
|Local Collection|This is the main collection that is being queried.|
|Remote Collection|This is the collection that the XCJFQuery will query to 
resolve the join keys.|
|XCJFQuery|The lucene query that executes a search to get back a set of join 
keys from a remote collection|
|HashRangeQuery|The lucene query that matches only the documents whose hash 
code on a field falls within a specified range.|

 

 
||Param||Default||Required||Description||
|collection| |Required|The name of the external Solr collection to be queried 
to retrieve the set of join key values ( required )|
|zkHost| |Optional|The connection string to be used to connect to Zookeeper.  
zkHost and solrUrl are both optional parameters, and at most one of them should 
be specified.  If neither of zkHost or solrUrl are specified, the local 
Zookeeper cluster will be used. ( optional )|
|solrUrl| |Optional|The URL of the external Solr node to be queried ( optional 
)|
|from| |Required|The join key field name in the external collection ( required 
)|
|to| |Required|The join key field name in the local collection|
|v| |See Note|The query to be executed against the external Solr collection to 
retrieve the set of join key values.  Note:  The original query can be passed 
at the end of the string or as the "v" parameter.  It's recommended to use 
query parameter substitution with the "v" parameter to ensure no issues arise 
with the default query parsers.|
|routed|See Notes| |true / false.  If true, the XCJF query will use each 
shard's hash range to determine the set of join keys to retrieve for that 
shard.  This parameter improves the performance of the cross-collection join, 
but it depends on the local collection being routed by the toField.  If this 
parameter is not specified, the XCJF query will try to determine the correct 
value automatically.|
|ttl|3600| |The length of time that an XCJF query in the cache will be 
considered valid, in seconds.  Defaults to 3600 (one hour).  The XCJF query 
will not be aware of changes to the remote collection, so if the remote 
collection is updated, cached XCJF queries may give inaccurate results.  After 
the ttl period has expired, the XCJF query will re-execute the join against the 
remote collection.|
|_All others_| | |Any normal Solr parameter can also be specified as a local 
param.|

 

Example Solr Config.xml changes:

 
{{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
{{   }}{{class}}{{=}}{{"solr.LRUCache"}}
{{   }}{{size}}{{=}}{{"128"}}
{{   }}{{initialSize}}{{=}}{{"0"}}
{{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
 
{{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
{{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
{{}}
 
{{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} 
{{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} 
{{/>}}
 

Example Usage:

{!xcjf collection=}}{{"otherCollection"}} {{from=}}{{"fromField"}} 
{{to=}}{{"toField"}} {{v=}}{{"*:*"}}{{}
 
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SOLR-11384) add support for distributed graph query

2019-08-27 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916717#comment-16916717
 ] 

Kevin Watters commented on SOLR-11384:
--

[~erickerickson]  Streaming expressions are fundamentally different in their 
semantics to the graph query.  If there is renewed interested in this 
functionality, we can revisit it.

At the moment, we're in the process of building a new cross collection join 
operator (XCJF  cross-collection join filter).  The work there is a stepping 
stone for a fully distributed graph traversal.

[~komal_vmware] if you have a use case, let's chat about it.  I do have a 
version of the distributed graph query working locally, but I don't consider it 
prime time due to a few pesky items related to caching.

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12328) Adding graph json facet domain change

2018-05-15 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475938#comment-16475938
 ] 

Kevin Watters commented on SOLR-12328:
--

Hey Dan , this looks pretty awesome.  One comment, If the traversal filter is 
null/empty, I don't think the default match all query is needed.  So,  in the 
GraphField class,  I think you can probably get rid of that null check and 
default value for the traversal filter.

 

 

> Adding graph json facet domain change
> -
>
> Key: SOLR-12328
> URL: https://issues.apache.org/jira/browse/SOLR-12328
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.3
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: SOLR-12328.patch
>
>
> Json facets now support join queries via domain change. I've made a 
> relatively small enhancement to add graph to the mix. I'll attach a patch for 
> your viewing. I'm hoping this can be merged into solr proper. Please let me 
> know if there are any problems/changes/requirements. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (SOLR-11838) explore supporting Deeplearning4j NeuralNetwork models

2018-02-17 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-11838:
-
Comment: was deleted

(was: One small item that I'm coming across here, it would seem that solr is 
currently using Guava 14.0..  DL4j depends on Guava 20.0.  This dependency will 
break solrj if we integrate DL4j into Solr due to depricated methods in guava 
14.

Thoughts?  Maybe we should update solr for a newer version of guava?  (I'm 
going through the same integration with MyRobotLab now except I'm using an 
EmbeddedSolrServer at the moment.))

> explore supporting Deeplearning4j NeuralNetwork models
> --
>
> Key: SOLR-11838
> URL: https://issues.apache.org/jira/browse/SOLR-11838
> Project: Solr
>  Issue Type: New Feature
>Reporter: Christine Poerschke
>Priority: Major
> Attachments: SOLR-11838.patch, SOLR-11838.patch
>
>
> [~yuyano] wrote in SOLR-11597:
> bq. ... If we think to apply this to more complex neural networks in the 
> future, we will need to support layers ...
> [~malcorn_redhat] wrote in SOLR-11597:
> bq. ... In my opinion, if this is a route Solr eventually wants to go, I 
> think a better strategy would be to just add a dependency on 
> [Deeplearning4j|https://deeplearning4j.org/] ...
> Creating this ticket for the idea to be explored further (if anyone is 
> interested in exploring it), complimentary to and independent of the 
> SOLR-11597 RankNet related effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11838) explore supporting Deeplearning4j NeuralNetwork models

2018-02-17 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368294#comment-16368294
 ] 

Kevin Watters commented on SOLR-11838:
--

One small item that I'm coming across here, it would seem that solr is 
currently using Guava 14.0..  DL4j depends on Guava 20.0.  This dependency will 
break solrj if we integrate DL4j into Solr due to depricated methods in guava 
14.

Thoughts?  Maybe we should update solr for a newer version of guava?  (I'm 
going through the same integration with MyRobotLab now except I'm using an 
EmbeddedSolrServer at the moment.)

> explore supporting Deeplearning4j NeuralNetwork models
> --
>
> Key: SOLR-11838
> URL: https://issues.apache.org/jira/browse/SOLR-11838
> Project: Solr
>  Issue Type: New Feature
>Reporter: Christine Poerschke
>Priority: Major
> Attachments: SOLR-11838.patch, SOLR-11838.patch
>
>
> [~yuyano] wrote in SOLR-11597:
> bq. ... If we think to apply this to more complex neural networks in the 
> future, we will need to support layers ...
> [~malcorn_redhat] wrote in SOLR-11597:
> bq. ... In my opinion, if this is a route Solr eventually wants to go, I 
> think a better strategy would be to just add a dependency on 
> [Deeplearning4j|https://deeplearning4j.org/] ...
> Creating this ticket for the idea to be explored further (if anyone is 
> interested in exploring it), complimentary to and independent of the 
> SOLR-11597 RankNet related effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11838) explore supporting Deeplearning4j NeuralNetwork models

2018-01-29 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343602#comment-16343602
 ] 

Kevin Watters commented on SOLR-11838:
--

I'm very excited to see this integration happening.  [~gus_heck] has been 
working with me on some DL4j projects in particular training models and 
evaluating them for classification.  I think at a high level there are 3 main 
integration patterns that we could / should consider in Solr.
 # using a model at ingest time to tag / annotate a record going into the 
index.  (primary example would be something like sentiment analysis tagging.)  
This implies the model was trained and saved somewhere.
 # using a solr index (query) to generate a set of training test data so that 
DL4j can "fit" the model and train it.  (there might even be a desire for some 
join functionality so you can join together 2 datasets to create adhoc training 
datasets.)
 # (this is a bit more out there.)  indexing each node of the multi layer 
network / computation graph as a document in the index, and use a query to 
evaluate the output of the model by traversing the documents in the index to 
ultimately come up with a set of relevancy scores for the documents that 
represent the output layer of the network.

I think , perhaps, the most interesting use case is #2.  So basically, the idea 
is you want to define a network  (specify the layers, types of layers, 
activation function, etc..) and then specify a query that matches a set of 
documents in the index that would be used to train that model.  Currently DL4j 
uses "datavec" to handle all the data normalization prior to going into the 
model for training.  That exposes a DataSetIterator.  The datasetiterator could 
be replaced with an iterator that sits ontop of the export handler or even just 
a raw search result.  The general use cases here for pagination would be 
 # to iterate the full result set  (presumably multiple times as the model will 
make multiple passes over the data when training.)
 # generate a random ordering of the dataset being returned
 # excluding a random (but deterministic?) set of documents from the main query 
to provide a holdout testing dataset.

Keeping in mind that typically in network training, you have both your training 
dataset and the testing dataset.  

The final outcome of this would be a computationgraph/multilayernetwork which 
can be serialized by dl4j as a json file, and the other output could/should be 
the evaluation or accuracy scores of the model  (F1, Accuracy, and confusion 
matrix.)

As per the comments about natives, yes, there are definitely platform dependent 
parts of DL4j, in particular the "nd4j" which can be gpu/cpu, but there are 
also other dependencies on javacv/javacpp.  The javacv/javacpp stuff is really 
only used for image manipulation as it's the java binding to OpenCV.  The 
dependency tree for DL4j is rather large, so I think we'll need to take 
care/caution that we're not injecting a bunch of conflicting jar files.  
Perhaps, if we identify the conflicting jar versions. 

 

> explore supporting Deeplearning4j NeuralNetwork models
> --
>
> Key: SOLR-11838
> URL: https://issues.apache.org/jira/browse/SOLR-11838
> Project: Solr
>  Issue Type: New Feature
>Reporter: Christine Poerschke
>Priority: Major
> Attachments: SOLR-11838.patch
>
>
> [~yuyano] wrote in SOLR-11597:
> bq. ... If we think to apply this to more complex neural networks in the 
> future, we will need to support layers ...
> [~malcorn_redhat] wrote in SOLR-11597:
> bq. ... In my opinion, if this is a route Solr eventually wants to go, I 
> think a better strategy would be to just add a dependency on 
> [Deeplearning4j|https://deeplearning4j.org/] ...
> Creating this ticket for the idea to be explored further (if anyone is 
> interested in exploring it), complimentary to and independent of the 
> SOLR-11597 RankNet related effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11384) add support for distributed graph query

2017-10-26 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-11384:
-
Issue Type: Improvement  (was: Bug)

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11384) add support for distributed graph query

2017-09-21 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-11384:
-
Priority: Minor  (was: Major)

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-11384) add support for distributed graph query

2017-09-21 Thread Kevin Watters (JIRA)
Kevin Watters created SOLR-11384:


 Summary: add support for distributed graph query
 Key: SOLR-11384
 URL: https://issues.apache.org/jira/browse/SOLR-11384
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Kevin Watters


Creating this ticket to track the work that I've done on the distributed graph 
traversal support in solr.

Current GraphQuery will only work on a single core, which introduces some 
limits on where it can be used and also complexities if you want to scale it.  
I believe there's a strong desire to support a fully distributed method of 
doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
anyone would like to have a look at the approach and implementation,  I welcome 
much feedback.  

The flow for the distributed graph query is almost exactly the same as the 
normal graph query.  The only difference is how it discovers the "frontier 
query" at each level of the traversal.  

When a distribute graph query request comes in, each shard begins by running 
the root query, to know where to start on it's shard.  Each participating shard 
then discovers it's edges for the next hop.  Those edges are then broadcast to 
all other participating shards.  The shard then receives all the parts of the 
frontier query , assembles it, and executes it.

This process continues on each shard until there are no new edges left, or the 
maxDepth of the traversal has finished.

The approach is to introduce a FrontierBroker that resides as a singleton on 
each one of the solr nodes in the cluster.  When a graph query is created, it 
can do a getInstance() on it so it can listen on the frontier parts coming in.

Initially, I was using an external Kafka broker to handle this, and it did work 
pretty well.  The new approach is migrating the FrontierBroker to be a request 
handler in Solr, and potentially to use the SolrJ client to publish the edges 
to each node in the cluster.

There are a few outstanding design questions, first being, how do we know what 
the list of shards are that are participating in the current query request?  Is 
that easy info to get at?

Second,  currently, we are serializing a query object between the shards, 
perhaps we should consider a slightly different abstraction, and serialize 
lists of "edge" objects between the nodes.   The point of this would be to 
batch the exploration/traversal of current frontier to help avoid large bursts 
of memory being required.

Thrid, what sort of caching strategy should be introduced for the frontier 
queries, if any?  And if we do some caching there, how/when should the entries 
be expired and auto-warmed.







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9415) graph search filter edge

2016-08-19 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428480#comment-15428480
 ] 

Kevin Watters edited comment on SOLR-9415 at 8/19/16 5:20 PM:
--

Hello cmd,  
Are you using the GraphQueryParser? If so, you can add a "traversalFilter" with 
the query "relationship:College"  ... 

should be something like:

 !{graph from="name1" to="name2" 
traversalFilter="+relationship:College_school_classmate +time:[2015-01-01 TO 
2016-01-01]"}name1:tom

-Kevin



was (Author: kwatters):
Hello cmd,  
Are you using the GraphQueryParser? If so, you can add a "traversalFilter" with 
the query "relationship:College"  ... 

should be something like:

 !{graph from="name1" to="name2" 
traversalFilter="relationship:College_school_classmate"}name1:tom

-Kevin


> graph search filter edge
> 
>
> Key: SOLR-9415
> URL: https://issues.apache.org/jira/browse/SOLR-9415
> Project: Solr
>  Issue Type: Wish
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.1
>Reporter: cmd
> Fix For: 6.x
>
>
> currently solr graph hasn't edge concept! for example:
> name1(node),name2(node),relationtype,time,other edge attr.
> tom,alice,College_school_classmate,2016-10-01 
> tom,alice,High_school_classmate,2013-10-01
> tom,alice,middle_school_classmate,2009-10-01
> tom,alice,Primary_school_classmate,2005-10-01 
> tom,Smith,College_school_classmate,2016-10-01 
> tom,Smith,High_school_classmate,2013-10-01
> tom,Smith,middle_school_classmate,2009-10-01
> tom,Smith,Primary_school_classmate,2005-10-01 
> node
> tom  age:23 sex:male addr:
> Smith age:25 sex...
> alice   .
> i want to filter: tom time:[2009 to 2013]  and addr: and  
> relationtype=College is who?
> refer: http://graphml.graphdrawing.org/primer/graphml-primer.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9415) graph search filter edge

2016-08-19 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428480#comment-15428480
 ] 

Kevin Watters commented on SOLR-9415:
-

Hello cmd,  
Are you using the GraphQueryParser? If so, you can add a "traversalFilter" with 
the query "relationship:College"  ... 

should be something like:

 !{graph from="name1" to="name2" 
traversalFilter="relationship:College_school_classmate"}name1:tom

-Kevin


> graph search filter edge
> 
>
> Key: SOLR-9415
> URL: https://issues.apache.org/jira/browse/SOLR-9415
> Project: Solr
>  Issue Type: Wish
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.1
>Reporter: cmd
> Fix For: 6.x
>
>
> currently solr graph hasn't edge concept! for example:
> name1(node),name2(node),relationtype,time,other edge attr.
> tom,alice,College_school_classmate,2016-10-01 
> tom,alice,High_school_classmate,2013-10-01
> tom,alice,middle_school_classmate,2009-10-01
> tom,alice,Primary_school_classmate,2005-10-01 
> tom,Smith,College_school_classmate,2016-10-01 
> tom,Smith,High_school_classmate,2013-10-01
> tom,Smith,middle_school_classmate,2009-10-01
> tom,Smith,Primary_school_classmate,2005-10-01 
> node
> tom  age:23 sex:male addr:
> Smith age:25 sex...
> alice   .
> i want to filter: tom time:[2009 to 2013]  and addr: and  
> relationtype=College is who?
> refer: http://graphml.graphdrawing.org/primer/graphml-primer.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes

2016-04-27 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261408#comment-15261408
 ] 

Kevin Watters commented on SOLR-9027:
-

no specific use case.. but if doc frequency is 0 for a given term in a 
"node/from" field,  there's not much point in traversing it,or querying for it 
in the first place.   I'm not sure if that's even possible, but you might have 
edges that point to a document that doesn't exist, in such a case, it's an easy 
optimization to avoid that traversal.   (similar to the leafNodesOnly parameter 
on the GraphQuery.)


> Add GraphTermsQuery to limit traversal on high frequency nodes
> --
>
> Key: SOLR-9027
> URL: https://issues.apache.org/jira/browse/SOLR-9027
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Priority: Minor
> Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, 
> SOLR-9027.patch
>
>
> The gatherNodes() Streaming Expression is currently using a basic disjunction 
> query to perform the traversals. This ticket is to create a specific 
> GraphTermsQuery for performing the traversals. 
> The GraphTermsQuery will be based off of the TermsQuery, but will also 
> include an option for a docFreq cutoff. Terms that are above the docFreq 
> cutoff will not be included in the query. This will help users do a more 
> precise and efficient traversal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes

2016-04-27 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261311#comment-15261311
 ] 

Kevin Watters commented on SOLR-9027:
-

Yes, sorry for the typo, minDocFreq :)  avoid sparse edges .. could be a useful 
use case.. (especially in a distributed use case)

> Add GraphTermsQuery to limit traversal on high frequency nodes
> --
>
> Key: SOLR-9027
> URL: https://issues.apache.org/jira/browse/SOLR-9027
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Priority: Minor
> Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, 
> SOLR-9027.patch
>
>
> The gatherNodes() Streaming Expression is currently using a basic disjunction 
> query to perform the traversals. This ticket is to create a specific 
> GraphTermsQuery for performing the traversals. 
> The GraphTermsQuery will be based off of the TermsQuery, but will also 
> include an option for a docFreq cutoff. Terms that are above the docFreq 
> cutoff will not be included in the query. This will help users do a more 
> precise and efficient traversal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9027) Add GraphTermsQuery to limit traversal on high frequency nodes

2016-04-27 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261007#comment-15261007
 ] 

Kevin Watters commented on SOLR-9027:
-

Nice stuff Joel!  Just a thought, it might be nice to also provide a 
"maxDocFreq"  on the GraphTermsQuery...  relatively easy to add now, and it 
would allow graph traversal that ignore sparse edges...  

Either way, this is very cool,  It seems like this would be a natural 
enhancement for the GraphQuery when it builds the frontier.

> Add GraphTermsQuery to limit traversal on high frequency nodes
> --
>
> Key: SOLR-9027
> URL: https://issues.apache.org/jira/browse/SOLR-9027
> Project: Solr
>  Issue Type: New Feature
>Reporter: Joel Bernstein
>Priority: Minor
> Attachments: SOLR-9027.patch, SOLR-9027.patch, SOLR-9027.patch, 
> SOLR-9027.patch
>
>
> The gatherNodes() Streaming Expression is currently using a basic disjunction 
> query to perform the traversals. This ticket is to create a specific 
> GraphTermsQuery for performing the traversals. 
> The GraphTermsQuery will be based off of the TermsQuery, but will also 
> include an option for a docFreq cutoff. Terms that are above the docFreq 
> cutoff will not be included in the query. This will help users do a more 
> precise and efficient traversal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-29 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212525#comment-15212525
 ] 

Kevin Watters edited comment on SOLR-8176 at 3/29/16 2:19 PM:
--

Here's a patch with a basic implementation of a Kafka based frontier query 
broker to support distributed graph query traversal in Solr.  The unit test is 
commented out because it requires a Kafka broker to be running.  Also, there's 
some config parameters / properties that are hard coded.  Either way, this 
shows how to use the GraphQuery in a distributed graph traversal mode. 

Disclaimer:  this patch isn't intended to be merged, it's really only an 
example of how to do it.  there's a lot of cleanup that still needs to happen 
to make it ready for primetime.


was (Author: kwatters):
Here's a patch with a basic implementation of a Kafka based frontier query 
broker to support distributed graph query traversal in Solr.  The unit test is 
commented out because it requires a Kafka broker to be running.  Also, there's 
some config parameters / properties that are hard coded.  Either way, this 
shows how to use the GraphQuery in a distributed graph traversal mode. 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
> Attachments: SOLR-8176.patch
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-25 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-8176:

Attachment: SOLR-8176.patch

Here's a patch with a basic implementation of a Kafka based frontier query 
broker to support distributed graph query traversal in Solr.  The unit test is 
commented out because it requires a Kafka broker to be running.  Also, there's 
some config parameters / properties that are hard coded.  Either way, this 
shows how to use the GraphQuery in a distributed graph traversal mode. 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
> Attachments: SOLR-8176.patch
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-18 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197814#comment-15197814
 ] 

Kevin Watters commented on SOLR-8176:
-

Hi Gopal,
  I'm running a little bit behind the times,  I'm still working off a branch 
that was checked out from SVN.  I'll update to trunk from git and make sure my 
local tests are still passing and I'll post a patch after I can clean up my 
comments and code a little bit.

Joel, 
  Thanks for the pointer,  I'll have a look at the TopicStream...  It might do 
what we need.  If not, maybe we can extend it.  I've been focusing on Kafka 
because it's pretty simple, generic, robust and scales really well.  I'm not 
tied to any particular technology for it, so long as we can publish some 
objects with a unique topic identifier.



> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8176) Model distributed graph traversals with Streaming Expressions

2016-03-12 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192120#comment-15192120
 ] 

Kevin Watters commented on SOLR-8176:
-

Hey Guys,  I know you're really focusing on streaming expressions for graph 
traversal, I just wanted to throw it out there.  I have a version of it working 
based on the GraphQuery.  It's completely distributed, the only kicker is, I 
implemented it with a dependency on Kafka as a message broker to handle the 
shuffling of the frontier query.  I was curious if there's a message broker 
already in the Solr stack, if so, it should be reasonably easy to swap out the 
kafka dependency and then we'll all have a fully distributed graph traversal in 
Solr.  Let me know what you think, 

> Model distributed graph traversals with Streaming Expressions
> -
>
> Key: SOLR-8176
> URL: https://issues.apache.org/jira/browse/SOLR-8176
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java, SolrCloud, SolrJ
>Affects Versions: master
>Reporter: Joel Bernstein
>  Labels: Graph
> Fix For: master
>
>
> I think it would be useful to model a few *distributed graph traversal* use 
> cases with Solr's *Streaming Expression* language. This ticket will explore 
> different approaches with a goal of implementing two or three common graph 
> traversal use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-8532) Optimize GraphQuery when maxDepth is set

2016-01-20 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-8532:

Attachment: SOLR-8532.patch

This is the patch for optimizations for the graph query.

> Optimize GraphQuery when maxDepth is set
> 
>
> Key: SOLR-8532
> URL: https://issues.apache.org/jira/browse/SOLR-8532
> Project: Solr
>  Issue Type: Bug
>Reporter: Kevin Watters
> Attachments: SOLR-8532.patch
>
>
> the current graph query implementation always collects edges.  When a 
> maxDepth is specified, there is an obvious optimization to not collect edges 
> at the maxDepth level.  
> In addition there are some other memory optimizations that I'd like to merge 
> in.  I have an updated version that includes the above optimization, in 
> addition, there are some memory optimizations that can be applied if 
> returnRoot != false.  In that, It doesn't need to hold on to the original 
> docset that matched the root nodes of the query.  
> I will be posting the patch in the next few days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-8532) Optimize GraphQuery when maxDepth is set

2016-01-10 Thread Kevin Watters (JIRA)
Kevin Watters created SOLR-8532:
---

 Summary: Optimize GraphQuery when maxDepth is set
 Key: SOLR-8532
 URL: https://issues.apache.org/jira/browse/SOLR-8532
 Project: Solr
  Issue Type: Bug
Reporter: Kevin Watters


the current graph query implementation always collects edges.  When a maxDepth 
is specified, there is an obvious optimization to not collect edges at the 
maxDepth level.  

In addition there are some other memory optimizations that I'd like to merge 
in.  I have an updated version that includes the above optimization, in 
addition, there are some memory optimizations that can be applied if returnRoot 
!= false.  In that, It doesn't need to hold on to the original docset that 
matched the root nodes of the query.  

I will be posting the patch in the next few days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-10-06 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945822#comment-14945822
 ] 

Kevin Watters commented on SOLR-7543:
-

Nice improvements!  The new TermsQuery, that definitely is a nice fit for this 
type of query.  (though that code path is only active if useAutn=false so it 
doesn't do the automaton compilation. )
Looks good to me, lets roll with it!

> Create GraphQuery that allows graph traversal as a query operator.
> --
>
> Key: SOLR-7543
> URL: https://issues.apache.org/jira/browse/SOLR-7543
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Kevin Watters
>Priority: Minor
> Attachments: SOLR-7543.patch, SOLR-7543.patch
>
>
> I have a GraphQuery that I implemented a long time back that allows a user to 
> specify a "startQuery" to identify which documents to start graph traversal 
> from.  It then gathers up the edge ids for those documents , optionally 
> applies an additional filter.  The query is then re-executed continually 
> until no new edge ids are identified.  I am currently hosting this code up at 
> https://github.com/kwatters/solrgraph and I would like to work with the 
> community to get some feedback and ultimately get it committed back in as a 
> lucene query.
> Here's a bit more of a description of the parameters for the query / graph 
> traversal:
> q - the initial start query that identifies the universe of documents to 
> start traversal from.
> fromField - the field name that contains the node id
> toField - the name of the field that contains the edge id(s).
> traversalFilter - this is an additional query that can be supplied to limit 
> the scope of graph traversal to just the edges that satisfy the 
> traversalFilter query.
> maxDepth - integer specifying how deep the breadth first search should go.
> returnStartNodes - boolean to determine if the documents that matched the 
> original "q" should be returned as part of the graph.
> onlyLeafNodes - boolean that filters the graph query to only return 
> documents/nodes that have no edges.
> We identify a set of documents with "q" as any arbitrary lucene query.  It 
> will collect the values in the fromField, create an OR query with those 
> values , optionally apply an additional constraint from the "traversalFilter" 
> and walk the result set until no new edges are detected.  Traversal can also 
> be stopped at N hops away as defined with the maxDepth.  This is a BFS 
> (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
> the same document for edge extraction.  
> This query operator does not keep track of how you arrived at the document, 
> but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-10-05 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-7543:

Attachment: SOLR-7543.patch

Patch with GraphQuery / parsers / unit tests.

> Create GraphQuery that allows graph traversal as a query operator.
> --
>
> Key: SOLR-7543
> URL: https://issues.apache.org/jira/browse/SOLR-7543
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Kevin Watters
>Priority: Minor
> Attachments: SOLR-7543.patch
>
>
> I have a GraphQuery that I implemented a long time back that allows a user to 
> specify a "startQuery" to identify which documents to start graph traversal 
> from.  It then gathers up the edge ids for those documents , optionally 
> applies an additional filter.  The query is then re-executed continually 
> until no new edge ids are identified.  I am currently hosting this code up at 
> https://github.com/kwatters/solrgraph and I would like to work with the 
> community to get some feedback and ultimately get it committed back in as a 
> lucene query.
> Here's a bit more of a description of the parameters for the query / graph 
> traversal:
> q - the initial start query that identifies the universe of documents to 
> start traversal from.
> fromField - the field name that contains the node id
> toField - the name of the field that contains the edge id(s).
> traversalFilter - this is an additional query that can be supplied to limit 
> the scope of graph traversal to just the edges that satisfy the 
> traversalFilter query.
> maxDepth - integer specifying how deep the breadth first search should go.
> returnStartNodes - boolean to determine if the documents that matched the 
> original "q" should be returned as part of the graph.
> onlyLeafNodes - boolean that filters the graph query to only return 
> documents/nodes that have no edges.
> We identify a set of documents with "q" as any arbitrary lucene query.  It 
> will collect the values in the fromField, create an OR query with those 
> values , optionally apply an additional constraint from the "traversalFilter" 
> and walk the result set until no new edges are detected.  Traversal can also 
> be stopped at N hops away as defined with the maxDepth.  This is a BFS 
> (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
> the same document for edge extraction.  
> This query operator does not keep track of how you arrived at the document, 
> but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-19 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550544#comment-14550544
 ] 

Kevin Watters commented on SOLR-7543:
-

[~ysee...@gmail.com] , right now, it builds against 4.x  If I submit a patch, 
should be done for trunk, or is a 4.x branch ok? I'm just finishing up the unit 
tests, either way, I hope to have a patch submitted by the end of the week.

 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived at the document, 
 but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-18 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549615#comment-14549615
 ] 

Kevin Watters commented on SOLR-7543:
-

Hi [~steff1193] I agree we want to do all sorts of great types of graph 
queries.  Problem is, as soon as you take the step to maintain metadata about 
the graph traversal, the memory requirements for such an operation can be huge.

The way I see it is there are likely 3 things to do to close the gap:
* Make the traversalFilter a more complex datastructure (like an array), to 
allow different filters at different graph traversal levels.
* accumulate a weight field on the traversed edges as part of the relevancy 
score (currently no ranking is done)
* maintain the history of edges that traverse into a node.

All of these could be considered for future functionality, but it would really 
take some re-thinking of how it all works.  For now, having the functionality 
to apply the graph as a filter to the result set is the goal.

In many cases, if you nest these graph queries, and the documents are 
structured properly, you should still be able to achieve the result that you 
desire, but we'd have to take that on a case by case basis.

 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived at the document, 
 but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-18 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549652#comment-14549652
 ] 

Kevin Watters commented on SOLR-7543:
-

Ok, my initial graph query parser is handling the following syntax

{!graph from=node_field to=edge_field returnRoot=true 
returnOnlyLeaf=false maxDepth=-1 traversalFilter=foo:bar}id:doc_8

The above would start traversal at doc_8 and only walk nodes that have a field 
foo containing the value bar.  This seems to be (more) consistent with the rest 
of the query parsers.

 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived at the document, 
 but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-15 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545546#comment-14545546
 ] 

Kevin Watters commented on SOLR-7543:
-

Interesting Dennis,  I wasn't aware of SOLR-7377,  I'll have to take a bit more 
time to understand what that means in the context of the graph query.  I'm not 
sure how cross collection graph traversal will play with my implementation.  
Issue is, my lucene graph query currently is local to a single shard/core.  I 
have been chatting with [~joel.bernstein] about the distributed graph traversal 
use case, and I think there is a play for streaming aggregation there.  There 
is one line that needs to be coordinated/synchronized across the cluster to do 
the distributed graph traversal, I think that's where the streaming stuff comes 
in.  
I like the idea of renaming   returnStartNodes to returnRoot ... less words 
and hopefully more descriptive of what is happening, (same for returnOnlyLeaf  
.. )  maybe the word nodes is redundant, and it obscures that it's really 
just a document at the end of the day.

 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived at the document, 
 but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-15 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546031#comment-14546031
 ] 

Kevin Watters commented on SOLR-7543:
-

[~steff1193]  My mantra here is relates to something I heard once.  ??A graph 
is a filter on top of your data.-someone??  So, I'm offering this 
implementation up to solve that use case.  Analytics on top of that graph would 
be acheived via faceting or streaming aggregation.  Maybe there's something 
that Titan could leverage from this implementation?  There are some starting 
plans on doing a distributed version of this query operator.  

[~dpgove] Interesting syntax.  The usecase of children  4 isn't currently 
supported in my impl.  My impl doesn't have any history of the paths through 
the graph.  It only has the bitset that represents the matched documents.  I 
wanted to keep it as lean as possible.  We could start keeping around 
additional data structures during the traversal to count, but that can get very 
expensive very quickly.  My goal/desire here is to keep the memory usage to one 
bitset.


 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived at the document, 
 but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-14 Thread Kevin Watters (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Watters updated SOLR-7543:

Description: 
I have a GraphQuery that I implemented a long time back that allows a user to 
specify a startQuery to identify which documents to start graph traversal 
from.  It then gathers up the edge ids for those documents , optionally applies 
an additional filter.  The query is then re-executed continually until no new 
edge ids are identified.  I am currently hosting this code up at 
https://github.com/kwatters/solrgraph and I would like to work with the 
community to get some feedback and ultimately get it committed back in as a 
lucene query.

Here's a bit more of a description of the parameters for the query / graph 
traversal:

q - the initial start query that identifies the universe of documents to start 
traversal from.
fromField - the field name that contains the node id
toField - the name of the field that contains the edge id(s).
traversalFilter - this is an additional query that can be supplied to limit the 
scope of graph traversal to just the edges that satisfy the traversalFilter 
query.
maxDepth - integer specifying how deep the breadth first search should go.
returnStartNodes - boolean to determine if the documents that matched the 
original q should be returned as part of the graph.
onlyLeafNodes - boolean that filters the graph query to only return 
documents/nodes that have no edges.

We identify a set of documents with q as any arbitrary lucene query.  It will 
collect the values in the fromField, create an OR query with those values , 
optionally apply an additional constraint from the traversalFilter and walk 
the result set until no new edges are detected.  Traversal can also be stopped 
at N hops away as defined with the maxDepth.  This is a BFS (Breadth First 
Search) algorithm.  Cycle detection is done by not revisiting the same document 
for edge extraction.  

This query operator does not keep track of how you arrived at the document, but 
only that the traversal did arrive at the document.

  was:I have a GraphQuery that I implemented a long time back that allows a 
user to specify a seedQuery to identify which documents to start graph 
traversal from.  It then gathers up the edge ids for those documents , 
optionally applies an additional filter.  The query is then re-executed 
continually until no new edge ids are identified.  I am currently hosting this 
code up at https://github.com/kwatters/solrgraph and I would like to work with 
the community to get some feedback and ultimately get it committed back in as a 
lucene query.


 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived 

[jira] [Commented] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-14 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543982#comment-14543982
 ] 

Kevin Watters commented on SOLR-7543:
-

Hi Yonik,  thanks for chiming in!  Yup, you can think of this as a multi-step 
join.  In fact,  I use the graph operator with a maxDepth of 1 to implement an 
inner join.
I like things to be consistent (it's easier for others to grok that way), we 
can rename the fromField and the toField to be from and to. When it comes 
to the GraphQueryParser(Plugin),  I'm open to whatever the community likes and 
whatever is consistent with the other parsers out there. 
 (I've always been a bit thrown by the !parser_name syntax, which is why I also 
have a client side object model so that I programmatically build up an 
expression, I serialize that expression over to my custom parser that 
deserializes and converts into the appropriate lucene query objects.  ).  I 
suppose I just want to make sure that the  v=my_start_query  can be any 
arbitrary lucene query.   
I also still need to work up some richer examples and test cases as part of 
this ticket.


 Create GraphQuery that allows graph traversal as a query operator.
 --

 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor

 I have a GraphQuery that I implemented a long time back that allows a user to 
 specify a startQuery to identify which documents to start graph traversal 
 from.  It then gathers up the edge ids for those documents , optionally 
 applies an additional filter.  The query is then re-executed continually 
 until no new edge ids are identified.  I am currently hosting this code up at 
 https://github.com/kwatters/solrgraph and I would like to work with the 
 community to get some feedback and ultimately get it committed back in as a 
 lucene query.
 Here's a bit more of a description of the parameters for the query / graph 
 traversal:
 q - the initial start query that identifies the universe of documents to 
 start traversal from.
 fromField - the field name that contains the node id
 toField - the name of the field that contains the edge id(s).
 traversalFilter - this is an additional query that can be supplied to limit 
 the scope of graph traversal to just the edges that satisfy the 
 traversalFilter query.
 maxDepth - integer specifying how deep the breadth first search should go.
 returnStartNodes - boolean to determine if the documents that matched the 
 original q should be returned as part of the graph.
 onlyLeafNodes - boolean that filters the graph query to only return 
 documents/nodes that have no edges.
 We identify a set of documents with q as any arbitrary lucene query.  It 
 will collect the values in the fromField, create an OR query with those 
 values , optionally apply an additional constraint from the traversalFilter 
 and walk the result set until no new edges are detected.  Traversal can also 
 be stopped at N hops away as defined with the maxDepth.  This is a BFS 
 (Breadth First Search) algorithm.  Cycle detection is done by not revisiting 
 the same document for edge extraction.  
 This query operator does not keep track of how you arrived at the document, 
 but only that the traversal did arrive at the document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-7543) Create GraphQuery that allows graph traversal as a query operator.

2015-05-13 Thread Kevin Watters (JIRA)
Kevin Watters created SOLR-7543:
---

 Summary: Create GraphQuery that allows graph traversal as a query 
operator.
 Key: SOLR-7543
 URL: https://issues.apache.org/jira/browse/SOLR-7543
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Kevin Watters
Priority: Minor


I have a GraphQuery that I implemented a long time back that allows a user to 
specify a seedQuery to identify which documents to start graph traversal 
from.  It then gathers up the edge ids for those documents , optionally applies 
an additional filter.  The query is then re-executed continually until no new 
edge ids are identified.  I am currently hosting this code up at 
https://github.com/kwatters/solrgraph and I would like to work with the 
community to get some feedback and ultimately get it committed back in as a 
lucene query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4787) Join Contrib

2013-05-09 Thread Kevin Watters (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652997#comment-13652997
 ] 

Kevin Watters commented on SOLR-4787:
-

Hey Joel, 
  It was good to meet you at the conference last week.  We talked a little bit 
about my GraphQuery operator.  The use case of a 1 level graph traversal can 
accomplish a post filter join request.  The caviot is that you won't know which 
record was joined to, only that it did satisfy the join requirement.  I could 
contribute it here, or perhaps we could create a Graph Contrib ticket?
Thanks,
  -Kevin

 Join Contrib
 

 Key: SOLR-4787
 URL: https://issues.apache.org/jira/browse/SOLR-4787
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 4.2.1
Reporter: Joel Bernstein
Priority: Minor
 Fix For: 4.2.1

 Attachments: SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
 SOLR-4787.patch


 This contrib provides a place where different join implementations can be 
 contributed to Solr. This contrib currently includes 2 join implementations. 
 The initial patch was generated from the Solr 4.2.1 tag. Because of changes 
 in the FieldCache API this patch will only build with Solr 4.2 or above.
 *PostFilterJoinQParserPlugin aka pjoin*
 The pjoin provides a join implementation that filters results in one core 
 based on the results of a search in another core. This is similar in 
 functionality to the JoinQParserPlugin but the implementation differs in a 
 couple of important ways.
 The first way is that the pjoin is designed to work with integer join keys 
 only. So, in order to use pjoin, integer join keys must be included in both 
 the to and from core.
 The second difference is that the pjoin builds memory structures that are 
 used to quickly connect the join keys. It also uses a custom SolrCache named 
 join to hold intermediate DocSets which are needed to build the join memory 
 structures. So, the pjoin will need more memory then the JoinQParserPlugin to 
 perform the join.
 The main advantage of the pjoin is that it can scale to join millions of keys 
 between cores.
 Because it's a PostFilter, it only needs to join records that match the main 
 query.
 The syntax of the pjoin is the same as the JoinQParserPlugin except that the 
 plugin is referenced by the string pjoin rather then join.
 fq=\{!pjoin fromCore=collection2 from=id_i to=id_i\}user:customer1
 The example filter query above will search the fromCore (collection2) for 
 user:customer1. This query will generate a list of values from the from 
 field that will be used to filter the main query. Only records from the main 
 query, where the to field is present in the from list will be included in 
 the results.
 The solrconfig.xml in the main query core must contain the reference to the 
 pjoin.
 queryParser name=pjoin 
 class=org.apache.solr.joins.PostFilterJoinQParserPlugin/
 And the join contrib jars must be registed in the solrconfig.xml.
 lib dir=../../../dist/ regex=solr-joins-\d.*\.jar /
 The solrconfig.xml in the fromcore must have the join SolrCache configured.
  cache name=join
   class=solr.LRUCache
   size=4096
   initialSize=1024
   /
 *JoinValueSourceParserPlugin aka vjoin*
 The second implementation is the JoinValueSourceParserPlugin aka vjoin. 
 This implements a ValueSource function query that can return values from a 
 second core based on join keys. This allows relevance data to be stored in a 
 separate core and then joined in the main query.
 The vjoin is called using the vjoin function query. For example:
 bf=vjoin(fromCore, fromKey, fromVal, toKey)
 This example shows vjoin being called by the edismax boost function 
 parameter. This example will return the fromVal from the fromCore. The 
 fromKey and toKey are used to link the records from the main query to the 
 records in the fromCore.
 As with the pjoin, both the fromKey and toKey must be integers. Also like 
 the pjoin, the join SolrCache is used to hold the join memory structures.
 To configure the vjoin you must register the ValueSource plugin in the 
 solrconfig.xml as follows:
  valueSourceParser name=vjoin 
 class=org.apache.solr.joins.JoinValueSourceParserPlugin /

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org