Re: Elevation in dataDir in Solr Cloud

2021-02-16 Thread Chris Hostetter


: Of course, here is the full stack trace (collection 'techproducts' with
: just one core to make it easier):

Ah yeah ... see -- this looks like a mistake introduced at some point...

: Caused by: org.apache.solr.core.SolrResourceNotFoundException: Can't
: find resource 'elevate.xml' in classpath or
: '/configs/techproductsConfExp', cwd=/usr/share/solr-7.7.2/server
:   at 
org.apache.solr.cloud.ZkSolrResourceLoader.openResource(ZkSolrResourceLoader.java:130)
:   at 
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:362)
:   at org.apache.solr.core.Config.(Config.java:120)
:   at org.apache.solr.core.Config.(Config.java:90)
:   at 
org.apache.solr.handler.component.QueryElevationComponent.loadElevationProvider(QueryElevationComponent.java:366)

...this bit of code is *expecting* to be able to init a Config object from 
the SolrResourceLoader, even thought this bit of code...

:   at 
org.apache.solr.handler.component.QueryElevationComponent.getElevationProvider(QueryElevationComponent.java:321)
:   at 
org.apache.solr.handler.component.QueryElevationComponent.loadElevationConfiguration(QueryElevationComponent.java:259)

...has already established that there is no "Config" file available from 
the resource loader, and we should be initializing an ElevationProvider 
that can raed from the data dir.  9and this code seems to be unchanged on 
branch_8x)

Can you please file a jira pointing out that this doesn't work along with 
the full stack trace, and then add a comment copy/pasting my comments here 
that the code makes no sense?


I'm not sure if/when someone who understands the code well enough will be 
able to help fix this (and write a test for it) ... was the experiment 
/ work around i suggested viable? ...


: > I don't know if it will work, but one thing you might want to experiment
: > with is putting your elevate.xml back the configset in zk, and updating it
: > on the fly in zk -- then see if it gets reloaded by each core the next
: > time the index changes (NOTE that there will almost certainly need to be
: > an index change for it to re-load, since I don't see any indication that
: > it's watching for changes in zk)
: >
: > FWIW: the way most people seem to be using QEC these days is to have an
: > empty elevate.xml file, and then have their application use some other
: > key/val store, or more complex matching logic, to decide which documents
: > to elevate, and then use the "elevateIds" param to pass that info to solr.


-Hoss
http://www.lucidworks.com/


Re: Elevation in dataDir in Solr Cloud

2021-02-12 Thread Chris Hostetter


: I need to have the elevate.xml file updated frequently and I was wondering
: if it is possible to put this file in the dataDir folder when using Solr
: Cloud. I know that this is possible in the standalone mode, and I haven't
: seen in the documentation [1] that it can not be done in Cloud.
: 
: I am using Solr 7.7.2 and ZooKeeper. After creating the techproducts
: collection for the tests, I remove the elevate.xml file from the
: configuration and I put it in the dataDir folder of the cores. When I
: update the collection with that configuration, I get the following error:
: "Can't find resource 'elevate.xml' in classpath or
: '/configs/techproductsConfExp'". Is this expected or I am doing something
: wrong?

Hmmm... can you share the full stack trace of that error?

(I suspect at some point someone made a sloppy assumption in the QEC code 
that no one would ever try to keep elevate.xml in the data dir in cloud 
mode.)

I don't know if it will work, but one thing you might want to experiment 
with is putting your elevate.xml back the configset in zk, and updating it 
on the fly in zk -- then see if it gets reloaded by each core the next 
time the index changes (NOTE that there will almost certainly need to be 
an index change for it to re-load, since I don't see any indication that 
it's watching for changes in zk)

FWIW: the way most people seem to be using QEC these days is to have an 
empty elevate.xml file, and then have their application use some other 
key/val store, or more complex matching logic, to decide which documents 
to elevate, and then use the "elevateIds" param to pass that info to solr.


-Hoss
http://www.lucidworks.com/


RE: Ghost Documents or Shards out of Sync

2021-02-09 Thread Chris Hostetter


: Let me add some background. A user triggers an operation which under the 
: hood needs to update a single field. Atomic update fails with a message 
: that one of the mandatory fields is missing (which is strange by 
: itself). When I query Solr for the exact document (fq with the document 
: id) I sometimes get the expected single result and sometimes zero. Those 
: queries are done sometimes couple of days later so auto commits 
: necessarily have been performed.

I suspect waht you are seeing is that the update succeeds on a leader, but 
for some reason (i'm not really understanding your description ofthe 
atomic udpate failure) it fals on areplica -- leaving them in an 
inconsistent state.  Restarting all the nodes forces the out of sync 
replica to recover.

If i'm correct, then when you see these inconsistent results, you should 
be able to query each individual *core* that is a replica of the shard 
this document belongs in, using a "distrib=false" request param, and see 
that it exists on the "leader" replica, but not on one/some of the other 
replicas.

Understanding why/how you got into this situation though would require 
understand what exactly you mean by "a message that one of the mandatory fields 
is missing"

can you show us some details?  solrconfig/schema, example documents, 
example updates, log messages from the various nodes when these updates 
"fail", etc... ?

https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists

: One more thing that might be important - we're using nested schema, and 
: we recently encountered several issues that make me think that this 
: combination - nested and atomic updates (of parent documents) - is the 
: root cause.

it's very possible that there are some bugs related to atomic updates and 
neste documents -- the code for dealing with that combination is 
relatively new, and making it work correctly requires special fields in 
the schema -- on top of the normal atomic update rules.  The documentation 
on this was heavily updated in the 8.7 ref-guide...

https://lucene.apache.org/solr/guide/8_7/indexing-nested-documents.html
https://lucene.apache.org/solr/guide/8_7/updating-parts-of-documents.html#updating-child-documents



-Hoss
http://www.lucidworks.com/


Re: Excessive logging 8.8.0

2021-02-04 Thread Chris Hostetter


FWIW: that log message was added to branch_8x by 3c02c9197376 as part of 
SOLR-15052 ... it's based on master commit 8505d4d416fd -- but that does 
not add that same logging message ... so it definitely smells like a 
mistake to me that 8x would add this INFO level log message that master 
doesn't have.

it's worth noting that 3c02c9197376 included many other "log.info(...)" 
messages that had 'nocommit' comments to change them to debug later ... 
making me more confident this is a mistake...

https://issues.apache.org/jira/browse/SOLR-15136


: Date: Thu, 4 Feb 2021 12:45:16 +0100
: From: Markus Jelsma 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Excessive logging 8.8.0
: 
: Hello all,
: 
: We upgraded some nodes to 8.8.0 and notice there is excessive logging on
: INFO when some traffic/indexing is going on:
: 
: 2021-02-04 11:42:48.535 INFO  (qtp261748192-268) [c:data s:shard2
: r:core_node4 x:data_shard2_replica_t2] o.a.s.c.c.ZkStateReader already
: watching , added to s
: tateWatchers
: 
: Is this to be expected?
: 
: Thanks,
: Markus
: 

-Hoss
http://www.lucidworks.com/


Re: Solr 8.7.0 memory leak?

2021-01-29 Thread Chris Hostetter


: there are not many OOM stack details printed in the solr log file, it's
: just saying No enough memory, and it's killed by oom.sh(solr's script).

not many isn't the same as none ... can you tell us *ANYTHING* about what 
the logs look like? ... as i said: it's not just the details of the OOM 
that would be helpful: any details about what the solr logs say solr is 
doing while the memory is growing (before the OOM) would also be helpful.

: My question(issue) is not it's OOM or not, the issue is why JVM memory
: usage keeps growing up but never going down, it's not how java programs
: work. the normal java process can use a lot of memory, but it will throw
: away after using it instead of keep it in the memory with reference.

you're absolutely right -- that's how a java program should be have, and 
that's what i'm seeing when I try to repoduce what you're describing with 
solr 8.7.0 by running a few nodes, creating a collection and waiting.

In other words: i can't reproduce what you are seing based on the 
information you've provided -- so the only thing i can do is to ask you 
for more information: what you see in the logs, what your configs are, the 
exact steps you take to trigger this situation, etc...

Please help us help you so we can figure out what is causing the 
behavior you are seeing and try to fix it

: > Knowing exactly what your config looks like would help, knowing exactly
: > what you do before you see the OOM would help (are you realy just creating
: > the collections, or is it actauly neccessary to index some docs into those
: > collections before you see this problem start to happen? what do the logs
: > say during the time when the heap usage is just growing w/o explanation?
: > what is the stack trace of the OOM? what does a heap abalysis show in
: > terms of large/leaked objects? etc.
: >
: > You have to help us understand the minimally viable steps we need
: > to execute to see the behavior you see
: >
: > https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists


-Hoss
http://www.lucidworks.com/


Re: Solr 8.7.0 memory leak?

2021-01-28 Thread Chris Hostetter


: Is the matter to use the config file ? I am using custom config instead 
: of _default, my config is from solr 8.6.2 with custom solrconfig.xml

Well, it depends on what's *IN* the custom config ... maybe you are using 
some built in functionality that has a bug but didn't get triggered by my 
simple test case -- or maybe you have custom components that have memory 
leaks.

The point of the question was to try and understand where/how you are 
running into an OOM i can't reproduce.

Knowing exactly what your config looks like would help, knowing exactly 
what you do before you see the OOM would help (are you realy just creating 
the collections, or is it actauly neccessary to index some docs into those 
collections before you see this problem start to happen? what do the logs 
say during the time when the heap usage is just growing w/o explanation? 
what is the stack trace of the OOM? what does a heap abalysis show in 
terms of large/leaked objects? etc.

You have to help us understand the minimally viable steps we need 
to execute to see the behavior you see

https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists

-Hoss
http://www.lucidworks.com/



Re: Solr 8.7.0 memory leak?

2021-01-28 Thread Chris Hostetter


FWIW, I just tried using 8.7.0 to run:
bin/solr -m 200m -e cloud -noprompt

And then setup the following bash one liner to poll the heap metrics...

while : ; do date; echo "node 8989" && (curl -sS 
http://localhost:8983/solr/admin/metrics | grep memory.heap); echo "node 7574" 
&& (curl -sS http://localhost:8983/solr/admin/metrics | grep memory.heap) ; 
sleep 30; done

...what i saw was about what i expected ... heap usage slowly grew on both 
nodes as bits of garbage were generated (as expected cosidering the 
metrics requests, let alone typical backgroup threads) until eventually it 
garbage collected back down to low usage w/o ever encountering an OOM or 
crash...


Thu Jan 28 12:38:47 MST 2021
node 8989
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.7613688659667969,
  "memory.heap.used":159670624,
node 7574
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.7713688659667969,
  "memory.heap.used":161767776,
Thu Jan 28 12:39:17 MST 2021
node 8989
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.7813688659667969,
  "memory.heap.used":163864928,
node 7574
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.7913688659667969,
  "memory.heap.used":165962080,
Thu Jan 28 12:39:47 MST 2021
node 8989
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.8063688659667969,
  "memory.heap.used":169107808,
node 7574
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.8113688659667969,
  "memory.heap.used":170156384,
Thu Jan 28 12:40:17 MST 2021
node 8989
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.3428504943847656,
  "memory.heap.used":71900960,
node 7574
  "memory.heap.committed":209715200,
  "memory.heap.init":209715200,
  "memory.heap.max":209715200,
  "memory.heap.usage":0.3528504943847656,
  "memory.heap.used":73998112,






-Hoss
http://www.lucidworks.com/


Re: Solr 8.7.0 memory leak?

2021-01-28 Thread Chris Hostetter


: Hi, I am using solr 8.7.0, centos 7, java 8.
: 
: I just created a few collections and no data, memory keeps growing but 
: never go down, until I got OOM and solr is killed

Are you usinga custom config set, or just the _default configs?

if you start up this single node with something like -Xmx5g and create 
5 collections and do nothing else, how long does it take you to see the 
OOM?



-Hoss
http://www.lucidworks.com/


Re: Is there way to autowarm new searcher using recently ran queries

2021-01-28 Thread Chris Hostetter


: I am wondering if there is a way to warmup new searcher on commit by
: rerunning queries processed by the last searcher. May be it happens by
: default but then I can't understand why we see high query times if those
: searchers are being warmed.

it only happens by default if you have an 'autowarmCount' enabled for each 
cache...

https://lucene.apache.org/solr/guide/8_7/query-settings-in-solrconfig.html#caches

But note that this warms the caches *individually* --it doesn't re-simular 
a "full request" so some things (like stored fields) may still be "cold" 
on disk.

This typically isn't a problem -- eccept for people relying on FieldCache 
-- which is a query tie "un-inversion" of fields for sorting/faceting -- 
and has no explicit solr configuration or warming.

for that you have to use soemthing like joel described -- static 
'newSearcher' QuerySenderListenr queries that will sort/facet on those 
fields

https://lucene.apache.org/solr/guide/8_7/query-settings-in-solrconfig.html#query-related-listeners

...but a better solution is to make sure you use DocValues on these fields 
instead.




-Hoss
http://www.lucidworks.com/


Re: maxBooleanClauses change in solr.xml not reflecting in solr 8.4.1

2021-01-06 Thread Chris Hostetter


: You need to update EVERY solrconfig.xml that the JVM is loading for this to
: actually work.

that has not been true for a while, see SOLR-13336 / SOLR-10921 ...

: > 2. updated  solr.xml :
: > ${solr.max.booleanClauses:2048}
: 
: I don't think it's currently possible to set the value with solr.xml.

Not only is it possible, it's neccessary -- the value in solr.xml acts as 
a hard upper limit (and affects all queries, even internally expanded 
queries) on the "soft limit" in solrconfig.xml (that only affects 
explicitly supplied boolean queries from users)

As to the original question...

> 2021-01-05 14:03:59.603 WARN  (qtp1545077099-27) x:col1_shard1_replica_n3
> o.a.s.c.SolrConfig solrconfig.xml:  of 2048 is greater
> than global limit of 1024 and will have no effect

I attempted to reproduce this with 8.4.1 and did not see the probem you 
are describing.

Are you 100% certain you are updating the correct solr.xml file?  If you 
add some non-xml giberish to the solr.xml you are editing does the solr 
node fail to start up?

Remember that when using SolrCloud, solr will try to load solr.xml from zk 
first, and only look on local disk if it can't be found in ZK ... look for 
log messages like "solr.xml found in ZooKeeper. Loading..." vs "Loading 
solr.xml from SolrHome (not found in ZooKeeper)"




-Hoss
http://www.lucidworks.com/


Re: how to check num found

2021-01-04 Thread Chris Hostetter


Can't you just configure nagios to do a "negative match" against 
numFound=0 ? ... ie: "if response matches 'numFound=0' fail the check."

(IIRC there's an '--invert-regex' option for this)

: Date: Mon, 28 Dec 2020 14:36:30 -0600
: From: Dmitri Maziuk 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: how to check num found
: 
: Hi all,
: 
: we're doing periodic database reloads from external sources and I'm trying to
: figure out how to monitor for errors. E.g. I'd run a query '?q=FOO:BAR=0'
: and check if "numFound" > 0, that'd tell me if the reload succeeded.
: 
: The check is done using nagios curl plugin, and while it can match a string in
: the response, the "> 0" check would require writing an extra parser -- it's a
: simple enough two-liner, but I'd rather not add more moving pieces if I can
: help it.
: 
: The best I can figure so far is
: ```
: fl=result:if(gt(docfreq(FOO,BAR),0)"YES","NO")=1
: ```
: -- returns '"result":"NO"' that our nagios plugin can look for.
: 
: Is there a better/simpler way?
: 
: TIA
: Dima
: 

-Hoss
http://www.lucidworks.com/


Re: Authentication for each collection

2020-10-01 Thread Chris Hostetter


https://lucene.apache.org/solr/guide/8_6/authentication-and-authorization-plugins.html

*Authentication* is global, but *Authorization* can be configured to use 
rules that restrict permissions on a per collection basis...

https://lucene.apache.org/solr/guide/8_6/rule-based-authorization-plugin.html#permissions-2

In concrete terms, the specific example you asked about is supported:

: Example ; user1:password1 for collection A
:  user2:password2 for collection B

what would *NOT* be supported is to have a distinct set of users for each 
collection, such that there could be two different "user1" instances, each 
with it's own password, where each "user1" had access to one collection.



: Date: Thu, 1 Oct 2020 13:45:14 -0700
: From: sambasivarao giddaluri 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Authentication for each collection
: 
: Hi All,
: We have 2 collections and we are using  basic authentication against solr ,
: configured in security.json . Is it possible to configure in such a way
: that we have different credentials for each collection . Please advise if
: there is any other approach i can look into.
: 
: Example ; user1:password1 for collection A
:  user2:password2 for collection B
: 

-Hoss
http://www.lucidworks.com/


Re: Semantic Knowledge Graph Jar File

2020-09-04 Thread Chris Hostetter


: I need to integrate Semantic Knowledge Graph with Solr 7.7.0 instance.

If you're talking about the work Trey Grainger has writtne some papers on 
that was originally implemented in this repo...

https://github.com/careerbuilder/semantic-knowledge-graph

..then that work was incorported into solr as the 'relatedness()' 
aggregation in JSON faceting, and has been included in Solr since 7.4...

https://issues.apache.org/jira/browse/SOLR-9480

https://lucene.apache.org/solr/guide/7_7/json-facet-api.html#semantic-knowledge-graphs


-Hoss
http://www.lucidworks.com/


Re: Solrj client 8.6.0 issue special characters in query

2020-08-07 Thread Chris Hostetter

: Hmm, setting -Dfile.encoding=UTF-8 solves the problem. I have to now check
: which component of the application screws it up, but at the moment I do NOT
: believe it is related to Solrj.

You can use the "forbidden-apis" project to analyze your code and look for 
uses of APIs that depend on the default file encoding, locale, charset, 
etc...

https://github.com/policeman-tools/forbidden-apis

...this project started as an offshoot of build rules in 
Lucene/Solr, precisely to help detect problems like the one you 
are facing -- and it's used to analyze all Solr code, which is why i'm 
pretty confident that no SolrJ code is mistakenly 
parsing/converting/encoding your input -- allghough in theory it could be 
a 3rd party library Solr uses.  (Hardcoding the unicode string in your 
java application and passing it as a solr param should help prove/disprove 
that)

: 
: On Fri, Aug 7, 2020 at 11:53 AM Jörn Franke  wrote:
: 
: > Dear all,
: >
: > I have the following issues. I have a Solrj Client 8.6 (but it happens
: > also in previous versions), where I execute, for example, the following
: > query:
: > Jörn
: >
: > If I look into Solr Admin UI it finds all the right results.
: >
: > If I use Solrj client then it does not find anything.
: > Further, investigating in debug mode it seems that the URI to server gets
: > wrongly encoded.
: > Jörn becomes J%C3%83%C2%B6rn
: > It should become only J%C3%B6rn
: > any idea why this happens and why it add %83%C2 inbetween? Those do not
: > seem to be even valid UTF-8 characters
: >
: > I verified with various statements that I give to Solrj the correct
: > encoded String "Jörn"
: >
: > Can anyone help me here?
: >
: > Thank you.
: >
: > best regards
: >
: 

-Hoss
http://www.lucidworks.com/

Re: Why External File Field is marked as indexed in solr admin SCHEMA page?

2020-07-22 Thread Chris Hostetter
: **
: 
: **
...
: I was expecting that for field "fieldA" indexed will be marked as false and
: it will not be part of the index. But Solr admin "SCHEMA page" (we get this
: option after selecting collection name in the drop-down menu)  is showing
: it as an indexed field (green tick mark under Indexed flag).

Because, per the docs, the IndexSchema uses a default assumption of "true" 
for the "indexed" property (if not specified at a field/fieldtype level) 
...

https://lucene.apache.org/solr/guide/8_4/field-type-definitions-and-properties.html#field-default-properties

Property: indexed
Descrption: If true, the value of the field can be used in queries to retrieve 
matching documents.
Values: true or false   
Implicit Default: true

...ExternalFileField is "special" and as noted in it's docs it is not 
searchable -- it doesn't actaully care what the indexed (or "stored") 
properties are ... but the default values of those properties as assigend 
by the schema defaults are still there in the metadata of the field -- 
which is what the schema API/browser are showing you.


Imagine you had a a  that was a TextField -- implicitly 
indexed="true" -- but it was impossible for you to ever put any values 
in that field (say for hte sake of argument you used an analyzier that 
threw away all terms).  The schema browser would say: "It's (implicitly) 
marked indexed=true, therefore it's searchable" even though searching on that 
field would never return anything ... equivilent situation with 
ExternalFileField.

(ExternalFileField could be modified to override the implicit default for 
these properties, but that's not something anyone has ever really worried 
about because it wouldn't functionally change any of it's behavior)


-Hoss
http://www.lucidworks.com/


Re: JSON Facet with local parameter

2020-07-13 Thread Chris Hostetter

The JSON based query APIs (including JSON Faceting) use (and unfortunately 
subtly different) '${NAME}' syntax for dereferencing variables in the 
"body" of a JSON data structure...

https://lucene.apache.org/solr/guide/8_5/json-request-api.html#parameter-substitution-macro-expansion

...but note that you may need to put "quotes" around the variable 
de-reference in order to make it a valid JSON string.


: Date: Mon, 13 Jul 2020 04:03:50 +
: From: Mohamed Sirajudeen Mayitti Ahamed Pillai
: 
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" 
: Subject: JSON Facet with local parameter
: 
: Is it possible to refer local parameter for Range JSON Facet’s star/end/gap 
inputs ?
: 
: 
: I am trying something like below, but it is now working.
: 
http://server:8983/solr/kfl/select?arrivalRange=NOW/DAY-10DAYS={"NEW 
ARRIVALS":{"start":$arrivalRange, 
"sort":"index","type":"range","field":"pdp_activation_date_dt","gap":"+10DAYS","mincount":1,"limit":-1,"end":"NOW/DAY"}}=*:*=0
: 
: Getting below error,
: 
: "error": {"metadata": 
["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":
 "Can't parse value $arrivalRange for field: pdp_activation_date_dt","code": 
400}
: 
: 
: How to instruct Solr JSON Facet to reference another parameter that is added 
to the search request ?
: 
: 
: 
: 
: 

-Hoss
http://www.lucidworks.com/

Re: Does 8.5.2 depend on 8.2.0

2020-06-18 Thread Chris Hostetter


: Subject: Does 8.5.2 depend on 8.2.0

No.  The code certainly doesn't, but i suppose it's possible some metadata 
somewhere in some pom file may be broken? 


: My build.gradle has this:
: compile(group: 'org.apache.solr', name: 'solr-solrj', version:'8.5.2')
: No where is there a reference to 8.2.0

it sounds like you are using transitive dependencies (otherwise it 
wouldn't make sense for you to wonder if 8.5.2 depends on 8.2.0) ... is it 
posisble some *other* library you are depending on is depending on 8.2.0 
directly? what does your dependency tree look like?

https://docs.gradle.org/current/userguide/viewing_debugging_dependencies.html


-Hoss
http://www.lucidworks.com/


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Chris Hostetter


First off: Forgive me if my comments/questions are redundent or uninformed 
bsaed o nthe larger discussion taking place.  I have not 
caught up on the whole thread before replying -- but that's solely based 
on a lack of time on my part, not a lack of willingness to embrace this 
change.


>From skimming a handful of messages in this thread, there's one aspect of 
the discussion -- particularly in the context of "should we or should we 
not re-use / overlap with 'SolrCloud' terminology" -- that seems (from 
limited review) to be getting overlooked...


Even today, before you start to consider if/what to replace the M & S 
terms with, we've already reach the point where the 
replication/IndexFetcher code can use those tems in confusing/missleading 
ways -- particularly in the context of SolrCloud, recovery, backups, 
etc...

I think a meaningful conversation about retiring the existing terminology 
should probably take into account at least 3 distinct questions:

#1 - what terminology should be used/expected/documented in non-SolrCloud 
usage of explicitly configuring "classic" ReplicationHandler in 
solrconfig.xml ?

#2 - what terminology should be used in the source code of
ReplicationHandler / IndexFetcher related code?

#3 - what terminology should be used in the *log* messages of 
ReplicationHandler / IndexFetcher related code? and (# 3a) should that 
terminology vary based on the type of cluster and when/why/how the 
replication code is being used?


In reverse order: My personal answers would be:

#3a - yes

#3 - I think we can & should tweak the replication code to understand 
the context it's being used in and adjust it's log/error messages accordingly

#2 - w/o digging into the code in depth here, i supect something like 
"remoteSource + localDest" would probably work well in the context of 
"index fetching"  -- or even just simply "fetchSource", w/o any need for a 
'slave' term equivilent because the context is usually just "local".  ... 
i don't think there any contexts where it really matters but 
"replicationSource + replicationDest" code be more generic terms for 
situations where the code isn't specifically a "I am a core that's doing 
fetching" context (alternatively: "remoteSource + localDest" could have 
parity with "localSource + remoteDest" if there are any situations i'm not 
thinking of where we "push" an index)

#1 - this is the least interesting question to me personally (given that 
we are moving away from it in general in favor of solr cloud), but i think 
in the context of how the terms are used in solrconfig.xml even the 
original M & S terms have never really made much sense if you look at 
where/how they are used in the configuration -- particularly in the 
context of the "repeater" use case.  I would suggest as a straw man 
replacing the 'name="master"' config block with a 
'name="provideSnapshots"' block using hte same options; and replace the 
'name="slave"' config block with 'name="fetchSnapshots"' config block, 
using mostly the same options, except replacing 'masterUrl' with 
'sourceUrl'.



-Hoss
http://www.lucidworks.com/


Re: TimestampUpdateProcessorFactory updates the field even if the value if present

2020-05-26 Thread Chris Hostetter
: Subject: TimestampUpdateProcessorFactory updates the field even if the value
: if present
: 
: Hi,
: 
: Following is the update request processor chain.
: 
:  <
: processor class="solr.TimestampUpdateProcessorFactory"> index_time_stamp_create
: 
: And, here is how the field is defined in schema.xml
: 
: 
: 
: Every time I index the same document, above field changes its value with
: latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
: page, if a document does not contain a value in the timestamp field, a new

based on the wording of your question, i suspect you are confused about 
the overall behavior of how "updating" an existing document works in solr, 
and how update processors "see" an *input document* when processing an 
add/update command.


First off, completley ignoring TimestampUpdateProcessorFactory and 
assuming just the simplest possibel update change, let's clarify how 
"updates" work, let's assume you when you say you "index the same 
document" twice you do so with a few diff field values ...

First Time...

{  id:"x",  title:"" }

Second time...

{  id:"x",  body:"      xxx" }

Solr does not implicitly know that you are trying to *update* that 
document, the final result will not be a document containing both a 
"title" field and "body" field in addition to the "id", it will *only* 
have the "id" and "body" fields and the title field will be lost.

The way to "update" a document *and keep existing field values* is with 
one of the "Atomic Update" command options...

https://lucene.apache.org/solr/guide/8_4/updating-parts-of-documents.html#UpdatingPartsofDocuments-AtomicUpdates

{  id:"x",  title:"" }

Second time...

{  id:"x",  body: { set: "      xxx" } }


Now, with that background info clarified: let's talk about update 
processors


The docs for TimestampUpdateProcessorFactory are refering to how it 
modifies an *input* document that it recieves (as part of the processor 
chain). It adds the timestamp field if it's not already in the *input* 
document, it doesn't know anything about wether that document is already 
in the index, or if it has a value for that field in the index.


When processors like TimestampUpdateProcessorFactory (or any other 
processor that modifies a *input* document) are run they don't know if the 
document you are "indexing" already exists in the index or not.  even if 
you are using the "atomic update" options to set/remove/add a field value, 
with the intent of preserving all other field values, the documents based 
down the processors chain don't include those values until the "document 
merger" logic is run -- as part of the DistributedUpdateProcessor (which 
if not explicit in your chain happens immediatly before the 
RunUpdateProcessorFactory)

Off the top of my head i don't know if there is an "easy" way to have a 
Timestamp added to "new" documents, but left "as is" for existing 
documents.

Untested idea

explicitly configured 
DistributedUpdateProcessorFactory, so that (in addition to putting 
TimestampUpdateProcessorFactory before it) you can 
also put MinFieldValueUpdateProcessorFactory on the timestamp field 
*after* DistributedUpdateProcessorFactory (but before 
RunUpdateProcessorFactory).  

I think that would work?

Just putting TimestampUpdateProcessorFactory after the 
DistributedUpdateProcessorFactory would be dangerous, because it would 
introduce descrepencies -- each replica would would up with it's own 
locally computed timestamp.  having the timetsamp generated before the 
distributed update processor ensures the value is computed only once.

-Hoss
http://www.lucidworks.com/


Re: stored=true what should I see from stem fields

2020-04-24 Thread Chris Hostetter


: Is what is shown in "analysis" the same as what is stored in a field?

https://lucene.apache.org/solr/guide/8_5/analyzers.html

The output of an Analyzer affects the terms indexed in a given field (and 
the terms used when parsing queries against those fields) but it has no 
impact on the stored value for the fields. For example: an analyzer might 
split "Brown Cow" into two indexed terms "brown" and "cow", but the stored 
value will still be a single String: "Brown Cow"


: So I indexed a document with "the quick brown fox jumped over the
: sleeping dog" set for stuff_raw and when I query for the document
: stuff_stems just has "the quick brown fox jumped over the sleeping
: dog" and NOT "quick brown fox jump over sleep dog"


https://lucene.apache.org/solr/guide/8_5/copying-fields.html

Fields are copied before analysis is done, meaning you can have two 
fields with identical original content, but which use different analysis 
chains and are stored in the index differently.



: Also stuff_everything only contains a single item, which is weird
: because I copy two things into it.

https://lucene.apache.org/solr/guide/8_5/copying-fields.html

Copying is done at the stream source level and no copy feeds into another 
copy. This means that copy fields cannot be chained i.e., you cannot copy 
from here to there and then from there to elsewhere. However, the same 
source field can be copied to multiple destination fields:


-Hoss
http://www.lucidworks.com/


Re: Solr filter cache hits not reflecting

2020-04-20 Thread Chris Hostetter
: 4) A query with different fq.
: 
http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung
...
: 5) A query with the same fq again (fq=manu:samsung OR manu:apple)the
: numbers don't get update for this fq hereafter for subsequent searches
: 
: 
http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple

that's not just *A* query with the same fq, it's the *exact* same request 
(q + sort + pagination + all filters)

Whch means that everything solr needs to reply to this request is 
available in the *queryResultCache* -- no filterCache needed at all (if 
you had faceting enabled that would be a different issue: then the 
filterCache would still be needed in order to compute facet counts over 
the entire DocSet matching the query, not just the current page window)...


$ bin/solr -e techproducts
...

# mostly empty caches (techproudct has a single static warming query)

$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
 | grep -E 
'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
  "CACHE.searcher.queryResultCache.lookups":0,
  "CACHE.searcher.queryResultCache.inserts":1,
  "CACHE.searcher.queryResultCache.hits":0}},
  "CACHE.searcher.filterCache.hits":0,
  "CACHE.searcher.filterCache.lookups":0,
  "CACHE.searcher.filterCache.inserts":0,

# new q and fq: lookup & insert into both caches...

$ curl -sS 
'http://localhost:8983/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple'
 > /dev/null
$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
 | grep -E 
'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
  "CACHE.searcher.queryResultCache.lookups":1,
  "CACHE.searcher.queryResultCache.inserts":2,
  "CACHE.searcher.queryResultCache.hits":0}},
  "CACHE.searcher.filterCache.hits":0,
  "CACHE.searcher.filterCache.lookups":1,
  "CACHE.searcher.filterCache.inserts":1,

# new q, same fq: 
# lookup on both caches, hit on filter, insert on queryResultCache

$ curl -sS 
'http://localhost:8983/solr/techproducts/select?q=*:*=manu:samsung%20OR%20manu:apple'
 > /dev/null
$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
 | grep -E 
'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
  "CACHE.searcher.queryResultCache.lookups":2,
  "CACHE.searcher.queryResultCache.inserts":3,
  "CACHE.searcher.queryResultCache.hits":0}},
  "CACHE.searcher.filterCache.hits":1,
  "CACHE.searcher.filterCache.lookups":2,
  "CACHE.searcher.filterCache.inserts":1,

# same q & fq as before:
# hit on queryresultCache means no filterCache needed...

$ curl -sS 
'http://localhost:8983/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple'
 > /dev/null
$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
 | grep -E 
'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
  "CACHE.searcher.queryResultCache.lookups":3,
  "CACHE.searcher.queryResultCache.inserts":3,
  "CACHE.searcher.queryResultCache.hits":1}},
  "CACHE.searcher.filterCache.hits":1,
  "CACHE.searcher.filterCache.lookups":2,
  "CACHE.searcher.filterCache.inserts":1,



-Hoss
http://www.lucidworks.com/


Re: Solr facet order same as result set

2020-04-20 Thread Chris Hostetter


The goal you are describing doesn't really sound at all like faceting -- 
it sounds like what you want might be "grouping" (or collapse/expand) 
... OR: depending on how you index your data perhaps what you really 
want is "nested documents" ... or maybe maybe if youre usecase is simple 
enough just using the "subquery" DocTransformer w/o needing explicit 
relationships between the docs at indexing time.

I would suggest you read the docs on each of these features and see what 
sounds best to you...

https://lucene.apache.org/solr/guide/8_5/result-grouping.html
https://lucene.apache.org/solr/guide/8_5/collapse-and-expand-results.html

https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html
https://lucene.apache.org/solr/guide/8_5/searching-nested-documents.html
https://lucene.apache.org/solr/guide/8_5/transforming-result-documents.html#child-childdoctransformerfactory

https://lucene.apache.org/solr/guide/8_5/transforming-result-documents.html#subquery


: Date: Mon, 20 Apr 2020 04:37:06 -0700 (MST)
: From: Venu 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Solr facet order same as result set
: 
: Probably I haven't framed my question properly.
: 
: Consider the schema with the fields - id, sku, fc_id, group_id
: The same SKU can be part of multiple documents with different fc_id and
: group_id.
: 
: For a given search query, multiple documents having the same SKU will be
: returned. Is there any way I can get all the fc_ids for those SKUs returned
: in the result set? Do I have to do a separate query with those SKUs again to
: fetch the fc_ids through json facets?
: 
: I am fetching the fc_ids through JSON-facets. But the order of those
: returned from facets is different from the result set. 
: 
: 
: 
: --
: Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
: 

-Hoss
http://www.lucidworks.com/


Re: Solr filter cache hits not reflecting

2020-04-20 Thread Chris Hostetter


: I was trying to analyze the filter cache performance and noticed a strange
: thing. Upon searching with fq, the entry gets added to the cache the first
: time. Observing from the "Stats/Plugins" tab on Solr admin UI, the 'lookup'
: and 'inserts' count gets incremented.
: However, if I search with the same fq again, I expect the lookup and hits
: count to increase, but it doesn't. This ultimately results in an incorrect
: hitratio.

We'll need to see the actual specifics of the requests you're executing & 
stats you're seeing in order to make any guesses as to why you're not 
seeing the expected outcome.

Wild guesses: 
- Are you use Date math based fq params that don't round?  
- Are you using SolrCloud and some of your requests are getting routed to 
different replicas?
- Are you using some complex/custom filter impl that may have a bug in 
it's equals/hashCode impl that prevents it from being a cache hit?


Here's an example showing that the basics of filterCache work find with 
8.5 for trivial examples...

$ bin/solr -e techproducts
...
$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true=filterCache'
 | grep 'CACHE.searcher.filterCache'
  "CACHE.searcher.filterCache.hits":0,
  "CACHE.searcher.filterCache.cumulative_evictions":0,
  "CACHE.searcher.filterCache.cleanupThread":false,
  "CACHE.searcher.filterCache.size":0,
  "CACHE.searcher.filterCache.maxRamMB":-1,
  "CACHE.searcher.filterCache.hitratio":0.0,
  "CACHE.searcher.filterCache.warmupTime":0,
  "CACHE.searcher.filterCache.idleEvictions":0,
  "CACHE.searcher.filterCache.evictions":0,
  "CACHE.searcher.filterCache.cumulative_hitratio":0.0,
  "CACHE.searcher.filterCache.lookups":0,
  "CACHE.searcher.filterCache.cumulative_hits":0,
  "CACHE.searcher.filterCache.cumulative_inserts":0,
  "CACHE.searcher.filterCache.ramBytesUsed":1328,
  "CACHE.searcher.filterCache.cumulative_idleEvictions":0,
  "CACHE.searcher.filterCache.inserts":0,
  "CACHE.searcher.filterCache.cumulative_lookups":0}}},
$ curl -sS 
'http://localhost:8983/solr/techproducts/query?q=*:*=inStock=true' > 
/dev/null
$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true=filterCache'
 | grep 'CACHE.searcher.filterCache'
  "CACHE.searcher.filterCache.hits":0,
  "CACHE.searcher.filterCache.cumulative_evictions":0,
  "CACHE.searcher.filterCache.cleanupThread":false,
  "CACHE.searcher.filterCache.size":1,
  "CACHE.searcher.filterCache.maxRamMB":-1,
  "CACHE.searcher.filterCache.hitratio":0.0,
  "CACHE.searcher.filterCache.warmupTime":0,
  "CACHE.searcher.filterCache.idleEvictions":0,
  "CACHE.searcher.filterCache.evictions":0,
  "CACHE.searcher.filterCache.cumulative_hitratio":0.0,
  "CACHE.searcher.filterCache.lookups":1,
  "CACHE.searcher.filterCache.cumulative_hits":0,
  "CACHE.searcher.filterCache.cumulative_inserts":1,
  "CACHE.searcher.filterCache.ramBytesUsed":4808,
  "CACHE.searcher.filterCache.cumulative_idleEvictions":0,
  "CACHE.searcher.filterCache.inserts":1,
  "CACHE.searcher.filterCache.cumulative_lookups":1}}},
$ curl -sS 
'http://localhost:8983/solr/techproducts/query?q=name:solr=inStock=true' > 
/dev/null
$ curl -sS 
'http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true=filterCache'
 | grep 'CACHE.searcher.filterCache'
  "CACHE.searcher.filterCache.hits":1,
  "CACHE.searcher.filterCache.cumulative_evictions":0,
  "CACHE.searcher.filterCache.cleanupThread":false,
  "CACHE.searcher.filterCache.size":1,
  "CACHE.searcher.filterCache.maxRamMB":-1,
  "CACHE.searcher.filterCache.hitratio":0.5,
  "CACHE.searcher.filterCache.warmupTime":0,
  "CACHE.searcher.filterCache.idleEvictions":0,
  "CACHE.searcher.filterCache.evictions":0,
  "CACHE.searcher.filterCache.cumulative_hitratio":0.5,
  "CACHE.searcher.filterCache.lookups":2,
  "CACHE.searcher.filterCache.cumulative_hits":1,
  "CACHE.searcher.filterCache.cumulative_inserts":1,
  "CACHE.searcher.filterCache.ramBytesUsed":4808,
  "CACHE.searcher.filterCache.cumulative_idleEvictions":0,
  "CACHE.searcher.filterCache.inserts":1,
  "CACHE.searcher.filterCache.cumulative_lookups":2}}},

...so the first time we use 'fq=inStock:true' we get a single lookup and a 
single insert.  he second time we use it (even with a different 'q' param) 
we get our 2nd lookup and our 1st hit -- no new inserts -- and now we have 
a 50% hitratio.

how does that compare with what you see?  what do similar commands show 
you with your fq?




-Hoss
http://www.lucidworks.com/


Re: Required operator (+) is being ignored when using default conjunction operator AND

2020-04-13 Thread Chris Hostetter
On Sat, 11 Apr 2020, Eran Buchnick wrote:

: Date: Sat, 11 Apr 2020 23:34:37 +0300
: From: Eran Buchnick 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Required operator (+) is being ignored when using default
: conjunction operator AND
: 
: Hoss, thanks a lot for the informative response. I understood my
: misunderstanding with infix and prefix operators. Need to rethink about the
: term occurrence support in my search service.

Sure,

And i appreciate you asking up about this -- it spurred me to file this...

https://issues.apache.org/jira/browse/LUCENE-9315


: On Mon, Apr 6, 2020, 20:43 Chris Hostetter  wrote:
: 
: >
: > : I red your attached blog post (and more) but still the penny hasn't
: > dropped
: > : yet about what causes the operator clash when the default operator is
: > AND.
: > : I red that when q.op=AND, OR will change the left(if not MUST_NOT) and
: > : right clause Occurs to SHOULD - what that means is that the "order of
: > : operations" in this case is giving the infix operator the mandate to
: > : control the prefix operator?
: >
: > Not quite anything that complex... sorry, but the blog post was focused
: > on
: > describe *what* happens when parsing, do explain why mixng prefix/infix is
: > bad ... i avoided getting bogged down into *why* it happens exactly the
: > way it does.
: >
: >
: > To get to the "why" you have to circle back to the higher level concept
: > that the "prefix" operators very closely align to the underlying concepts
: > of the BooleanQuery/BooleanClause data structures: that each clause has an
: > "Occur" property which is either: MUST/SHOULD/MUST_NOT (or FILTER, but
: > setting asside scoring that's functionally equivilent to MUST).
: >
: > The 'infix' operators just manipulate the Occur property of the clauses on
: > either side of them.
: >
: > 'q.op=AND' and 'q.op=OR' are functionally really about setting the
: > "Default Occur Value For All Clauses That Do Not Have An Explicit Occur
: > Value" (ie: q.op=Occur.MUST and q.op=Occur.SHOULD) ... where the explicit
: > Occur value for each clause would be specified by it's prefix (+=MUST,
: > -=MUST_NOT ... there is no supported prefix for SHOULD, which is why
: > q.op=SHOULD is the defualt nad chaning it complicates the parser logic)
: >
: > In essence: After the q.op/default.occur is applied to all clauses (that
: > don't already have a prefix), then there is a left to right parsing that
: > let's the infix operators modify the "Occur" values of the clauses on
: > either side of them -- if those Occur values match the "default" for this
: > parser.
: >
: > So let's imagine 2 requests...
: >
: > 1)  {!q.op=AND}a +b OR c +d AND e
: > 2)  {!q.op=OR} x +y OR z +r AND s
: >
: > Here's what those wind up looking like internally with the default
: > applied...
: >
: > 1) q.op=MUST:MUST(a)   MUST(b) OR MUST(c)   MUST(d) AND MUST(e)
: > 2) q.op=SHOULD:  SHOULD(x) MUST(y) OR SHOULD(z) MUST(r) AND SHOULD(s)
: >
: > And here's how the infix operators change things as it parses left to
: > right building up the clauses...
: >
: > 1) q.op=MUST:MUST(a)   SHOULD(b) SHOULD(c) MUST(d)  MUST(e)
: > 2) q.op=SHOULD:  SHOULD(x) MUST(y)   SHOULD(z) MUST(r)  MUST(s)
: >
: > It's not actually done in "two passes" -- it's just that as the parsing
: > is done left to right, the default Occur is used unless/until set by a
: > prefix operators, and infix operators not only set the occur value
: > for the "next" clause, but also reach back to override the prior
: > Occur value if it matches the Default: because there is no "history" kept
: > to indicate that it was explicitly set, or how.  the left to right parsing
: > just does the best it can with the context it's got.
: >
: > :  A little background - I am trying to implement a google search like
: > : service and want to have the ability to have required and prohibit
: > : operators while still allowing default intersection operation as default
: > : operator. How can I achieve this with this limitation?
: >
: > If you want "intersection" to be the defualt, i'm not sure why you care
: > about having a "required" operator? (you didn't mention anything about an
: > "optional" operator even though your original example explicitly used
: > "OR" ... so not really sure if that was just a contrived example or if you
: > actaully care about supporting it?
: >
: > If you're not hung up on using a specific syntax, you might want to
: > consider the "simple" QParser -- it unfortunately re-uses the 'q.op=AND'
: > param syntax to indicate what the default Occur s

Re: Use boolean operator "-", the result is incorrect

2020-04-08 Thread Chris Hostetter
: Solr/Lucene do not employ boolean logic. See Hossman’s excellent post:
: 
: https://lucidworks.com/post/why-not-and-or-and-not/
: 
: Until you internalize this rather subtle difference, you’ll be surprised. A 
lot ;).
: 
: You can make query parsing look a lot like boolean logic by carefully using 
parentheses…

Yup.  and to circle back to the original request...

: >>> id, name_s, age_i
: >>> 1, a, 10
: >>> 2, b, 10
: >>> Use the following query syntax:
: >>> -name_s:a OR age_i:10

tht says "Find all docs where age==10, then exclude docs where name==a

If what you want is "all docs where name!=a, combined with all docs where 
age==10" that would be...

(*:* -name_s:a) age_i:10


-Hoss
http://www.lucidworks.com/

Re: Required operator (+) is being ignored when using default conjunction operator AND

2020-04-06 Thread Chris Hostetter


: I red your attached blog post (and more) but still the penny hasn't dropped
: yet about what causes the operator clash when the default operator is AND.
: I red that when q.op=AND, OR will change the left(if not MUST_NOT) and
: right clause Occurs to SHOULD - what that means is that the "order of
: operations" in this case is giving the infix operator the mandate to
: control the prefix operator?

Not quite anything that complex... sorry, but the blog post was focused on  
describe *what* happens when parsing, do explain why mixng prefix/infix is 
bad ... i avoided getting bogged down into *why* it happens exactly the 
way it does.


To get to the "why" you have to circle back to the higher level concept 
that the "prefix" operators very closely align to the underlying concepts 
of the BooleanQuery/BooleanClause data structures: that each clause has an 
"Occur" property which is either: MUST/SHOULD/MUST_NOT (or FILTER, but 
setting asside scoring that's functionally equivilent to MUST).

The 'infix' operators just manipulate the Occur property of the clauses on 
either side of them.

'q.op=AND' and 'q.op=OR' are functionally really about setting the 
"Default Occur Value For All Clauses That Do Not Have An Explicit Occur 
Value" (ie: q.op=Occur.MUST and q.op=Occur.SHOULD) ... where the explicit 
Occur value for each clause would be specified by it's prefix (+=MUST, 
-=MUST_NOT ... there is no supported prefix for SHOULD, which is why 
q.op=SHOULD is the defualt nad chaning it complicates the parser logic)

In essence: After the q.op/default.occur is applied to all clauses (that 
don't already have a prefix), then there is a left to right parsing that 
let's the infix operators modify the "Occur" values of the clauses on 
either side of them -- if those Occur values match the "default" for this 
parser.

So let's imagine 2 requests...

1)  {!q.op=AND}a +b OR c +d AND e
2)  {!q.op=OR} x +y OR z +r AND s

Here's what those wind up looking like internally with the default 
applied...

1) q.op=MUST:MUST(a)   MUST(b) OR MUST(c)   MUST(d) AND MUST(e)
2) q.op=SHOULD:  SHOULD(x) MUST(y) OR SHOULD(z) MUST(r) AND SHOULD(s)

And here's how the infix operators change things as it parses left to 
right building up the clauses...

1) q.op=MUST:MUST(a)   SHOULD(b) SHOULD(c) MUST(d)  MUST(e)
2) q.op=SHOULD:  SHOULD(x) MUST(y)   SHOULD(z) MUST(r)  MUST(s)

It's not actually done in "two passes" -- it's just that as the parsing 
is done left to right, the default Occur is used unless/until set by a 
prefix operators, and infix operators not only set the occur value 
for the "next" clause, but also reach back to override the prior 
Occur value if it matches the Default: because there is no "history" kept 
to indicate that it was explicitly set, or how.  the left to right parsing 
just does the best it can with the context it's got.

:  A little background - I am trying to implement a google search like
: service and want to have the ability to have required and prohibit
: operators while still allowing default intersection operation as default
: operator. How can I achieve this with this limitation?

If you want "intersection" to be the defualt, i'm not sure why you care 
about having a "required" operator? (you didn't mention anything about an 
"optional" operator even though your original example explicitly used 
"OR" ... so not really sure if that was just a contrived example or if you 
actaully care about supporting it?

If you're not hung up on using a specific syntax, you might want to 
consider the "simple" QParser -- it unfortunately re-uses the 'q.op=AND' 
param syntax to indicate what the default Occur should be for clauses, but 
the overall syntax is much simple: there is a prefix negation operator, 
but other wise the infix "+" and "|" operators support boolean AND and OR 
-- there is no prefix operators for MUST/SHOULD.  You can also turn off 
individual operators you don't like...

https://lucene.apache.org/solr/guide/8_5/other-parsers.html#OtherParsers-SimpleQueryParser


-Hoss
http://www.lucidworks.com/


Re: match string fields with embedded hyphens

2020-04-03 Thread Chris Hostetter


: I am working with a customer who needs to be able to query various 
: account/customer ID fields which may or may not have embedded dashes.  
: But they want to be able to search by entering the dashes or not and by 
: entering partial values or not.
: 
: So we may have an account or customer ID like
: 
: 1234-56AB45
: 
: And they would like to retrieve this by searching for any of the following:
: 1234-56AB45 (full string match)
: 1234-56(partial string match)
: 123456AB45(full string but no dashes)
: 123456  (partial string no dashes)

To answer your lsat question first...

: So perhaps I will just ask - how would you define a fieldType which 
: should ignore special characters like hyphens or underscores (or 
: anything non-alphanumeric) and works for full string or partial string 
: search?

This is pretty much exactly what the "Word Delimiter Filter" was designed 
for, and i encourage you to play with it and it's various options and 
see what happens...

https://lucene.apache.org/solr/guide/8_5/filter-descriptions.html#word-delimiter-graph-filter

You've definitely need to enable som "non-default" options  (like 
"catenateNumbers=true") to ensure that you'd get indexed terms like 
"123456" from input "1234-56AB45"

Once thing that's not entirely clear from your question & input is how you 
define "partial string" ... for example: are you expecting a query of "12" 
to match your input document? because WDF won't help with that.

: But the behavior I see is completely unexpected. Full string match works 
: fine on the customer's DEV environment but not in QA (which is running 
: the same version of SOLR)

I garuntee you there is some difference between your DEV and QA 
environments.  Either in terms of the documents in the index, or the 
schema THAT WAS USED WHEN INDEXING THE DOCS --
which might have been changed after the indexing happened, or 
the "current" schema being used when the queries are getting 
parsed, or the default request options in solrconfig.xml ... something is 
absolutely different.

: Partial string match works for some ID fields but not others
: A Partial string match when the user does not enter the dashes just never 
works

I'm assuming these last 2 comments refer to behavior you see on *both* 
your DEV and QA instances?

Depending on your definition of "partial string" (see the question i asked 
above) then I _think_ the analyzer you have should work -- at least for 
all the examples you've provided.

The missing piece of information is *how* you are querying: what query 
parser you are using, what exactly the iput looks like; and also: the 
output: what does "never works" mean? ... does it match 0 docs? does it 
match docs you don't expect?

seeing the exact request URLs you are trying, with 
"debug=true=all" added, and the full output of those requests 
so we can see things like the header where we can confirm what 
default params might be getting added, and the query parrser debug info to 
doble check how your query is being parsed, and the "explain" info to see 
what docs that are matching (unexpectedly) are there.

More tips on details that can be useful to include to "help us help 
you"...

https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists

-Hoss
http://www.lucidworks.com/


Re: Inconsistent / confusing documentation on indexing nested documents.

2020-04-03 Thread Chris Hostetter


: Is the documentation wrong or have I misunderstood it?

The documentation is definitely wrong, thanks for pointing this out...

https://issues.apache.org/jira/browse/SOLR-14383


-Hoss
http://www.lucidworks.com/


Re: Required operator (+) is being ignored when using default conjunction operator AND

2020-04-01 Thread Chris Hostetter


: Using solr 8.3.0 it seems like required operator isn't functioning properly
: when default conjunction operator is AND.

You're mixing the "prefix operators" with the "infix operators" which is 
always a recipe for disaster.  

The use of q.op=AND vs q.op=OR in these examples only 
complicates the issue because q.op isn't really overriding any sort of implicit 
"infix operator" when clauses exist w/o an infix operator between them, it 
is overriding the implicit MUST/SHOULD/MUST_NOT given to each clause as 
parsed ... but in general setting q.op-AND really only makes sense when 
you expect/intend to only be using "infix operators"

This write up i did several years ago is still very accurate -- the bottom 
line is you REALLY don't want to mix infix and prefix operators..

https://lucidworks.com/post/why-not-and-or-and-not/

...because the results of mixing them really only "make sense" given the 
context that the parser goes left to right (ie: no precedence) and has 
no explicit "prefix" operator syntax for "SHOULD"


-Hoss
http://www.lucidworks.com/


Re: Custom update processor and race condition with concurrent requests

2020-03-04 Thread Chris Hostetter


: So, I thought it can be simplified by moving this state transitions and
: processing logic into Solr by writing a custom update processor. The idea
: occurred to me when I was thinking about Solr serializing multiple
: concurrent requests for a document on the leader replica. So, my thought
: process was if I am getting this serialization for free I can implement the
: entire processing inside Solr and a dumb client to push records to Solr
: would be sufficient. But, that's not working. Perhaps the point I missed is
: that even though this processing is moved inside Solr I still have a race
: condition because of time-of-check to time-of-update gap.

Correct.  Solr is (hand wavy) "locking" updates to documents by id on the 
leader node to ensure they are transactional, but that locking happens 
inside DistributedUpdateProcessor, other update processors don't run 
"inside" that lock.

: While writing this it just occurred to me that I'm running my custom update
: processor before DistributedProcessor. I'm committing the same XY crime
: again but if I run it after DistributedProcessor can this race condition be
: avoided?

no.  that would just introduce a whole new host of problems that are a 
much more ivolved conversation to get into (remeber: the processors after 
DUH run on every replica, after the leader has already assigned a 
version and said this update should go thorugh ... so now imagine what 
your error handling logic has to look like?)


Ultimately the goal that you're talking about really feels like "business 
logic that requires syncronizing/blocking updates" but you're trying to 
avoid writing a syncronized client to do that syncronization and error 
handling before forwarding those updates to solr.

I mean -- even with your explanation of your goal, there is a whole host 
of nuance / use case specific logic that has to go into "based on various 
conflicts it modifies the records for which update failed" -- and that 
logic seems like it would affect the locking: if you get a request that 
violates the legal state transition because of another request that 
(blocked it until it) just finished  now what?  fail? apply some new 
rules?

this seems like logic you should really want in a "middle ware" layer that 
your clients talk to and sends docs to solr.

If you *REALLY* want to try and piggy back this logic into solr, then 
there is _one_ place i can think of where you can "hook in" to the logic 
DistributedUpdateHandler does while "locking" an id on the leader, and 
that would be extending the AtomicUpdateDocumentMerger...

It's marked experimental, and I don't really understand the use cases 
for why it exists, and in order to customize this you would have to 
also subclass DistributedUpdateHandlerFactory to build your custom 
instance and pass it to the DUH constructor, but then -- in theory -- you 
could intercept any document update *after* the RTG, and before it's 
written to the TLOG, and apply some business logic.

But i wouldn't recommend this ... "the'r be Dragons!"



-Hoss
http://www.lucidworks.com/


Re: Custom update processor and race condition with concurrent requests

2020-03-03 Thread Chris Hostetter

It sounds like fundementally the problem you have is that you want solr to 
"block" all updates to docId=X ... at the update processor chain level ... 
until an existing update is done.

but solr has no way to know that you want to block at that level.

ie: you asked...

: In the case of multiple concurrent instances of the update processor
: are RealTimeGetComponent.getInputDocument()
: calls serialzed?

...but the answer to that question isn't really relevant, because 
regardless of the answer, there is no garuntee at the java thread 
scheduling level that the operations your custom code performs on the 
results will happen in any particular order -- even if 
RealTimeGetComponent.getInputDocument(42) where to block other concurrent 
calls to RealTimeGetComponent.getInputDocument(42) that wouldn't ensure 
that the custom code you have in Thread1 that calls that method will 
finish it's modifications to the SolrInputDocument *before* the same 
custom code in Thread2 calls RealTimeGetComponent.getInputDocument(42).

The only way to do something like this would be to add locking in your 
custom code itself -- based on the uniqueKey of the document -- to say 
"don't allow another thread to modify this document until i'm done" and 
keep that lock held until the delegated processAdd call finishes (so you 
know that the other update processors include RunUpdateProcessor has 
finished) ... but that would only work (easily) in a single node 
situation, in a multinode situation you'd have to first check the state of 
the request and ensure that your processor (and it's locking logic) only 
happen on the "leader" for that document, and deal with things at a 
distributed level ... andyou've got a whole host of new headaches.

I would really suggest you take a step back and re-think your objectve, 
and share with us the "end goal" you're trying to achieve with this custom 
update processor, because it seems you may haveheaded down an 
uneccessarily complex route.  

what exactly is it you're trying to achieve?

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341





: Date: Tue, 3 Mar 2020 23:52:38 +0530
: From: Sachin Divekar 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Custom update processor and race condition with concurrent
: requests
: 
: Thank, Erick.
: 
: I think I was not clear enough. With the custom update processor, I'm not
: using optimistic concurrency at all. The update processor just modifies the
: incoming document with updated field values and atomic update instructions.
: It then forwards the modified request further in the chain. So, just to be
: clear in this test setup optimistic concurrency is not in the picture.
: 
: However, it looks like if I want to run concurrent update requests I will
: have to use optimistic concurrency, be it in update processor or in the
: client. I was wondering if I can avoid that by serializing requests at the
: update processor level.
: 
: > Hmmm, _where_ is your custom update processor running? And is this
: SolrCloud?
: Currently, it's a single node Solr but eventually, it will be SolrCloud. I
: am just testing the idea of doing something like this. Right now I am
: running the custom update processor before DistributedProcessor in the
: chain.
: 
: > If you run it _after_ the update is distributed (i.e. insure it’ll run on
: the leader) _and_ you can insure that your custom update processor is smart
: enough to know which version of the document is the “right” one, I should
: think you can get this to work.
: I think that's the exact problem. My update processor fetches the document,
: updates the request object and forwards it in the chain. The two concurrent
: instances (S1 and S2) of the update processor can fetch the document, get
: value 'x' of field 'f1' at the same time and process them whereas ideally,
: S2 should see the value updated by S1.
: 
: S1: fetches id1 -> gets f1: x -> sets f1: y -> Solr append it to tlog
: S2: fetches id1 -> gets f1: x .. ideally it should get 'y'
: 
: Is that possible with UpdateProcessor? I am using realtimeget (
: RealTimeGetComponent.getInputDocument()) in the update processor to fetch
: the document.
: 
: > You’ll have to use “real time get”, which fetches the most current
: version of the document even if it hasn’t been committed and reject the
: update if it’s too old. Anything in this path requires that the desired
: update doesn’t depend on the value already having been changed by the first
: update...
: 
: In the case of multiple concurrent instances of the update processor
: are 

Re: Why does Solr sort on _docid_ with rows=0 ?

2020-03-02 Thread Chris Hostetter
: docid is the natural order of the posting lists, so there is no sorting 
effort.
: I expect that means “don’t sort”.

basically yes, as documented in the comment right above hte lines of code 
linked to.

: > So no one knows this then?
: > It seems like a good opportunity to get some performance!

The variable name is really stupid, but the 'solrQuery' variable you see 
in the code is *only* ever used for 'checkAZombieServer()' ... which 
should only be called when a server hasn't been responding to other (user 
initiated requests)

: >> I see a lot of such queries in my Solr 7.6.0 logs:

If you are seeing a lot of those queries, then there are other problems in 
your cluster you should investigate -- that's when/why LBSolrClient does 
this query -- to see if the server is responding.

: >> *path=/select
: >> params={q=*:*=false=_docid_+asc=0=javabin=2}
: >> hits=287128180 status=0 QTime=7173*

that is an abnormally large number of documents to have in a single shard.

: >> If you want to check a zombie server, shouldn't there be a much less
: >> expensive way to do a health-check instead?

Probably yes -- i've opened SOLR-14298...

https://issues.apache.org/jira/browse/SOLR-14298



-Hoss
http://www.lucidworks.com/

Re: Bug? Documents not visible after sucessful commit - chaos testing

2020-02-13 Thread Chris Hostetter


: We think this is a bug (silently dropping commits even if the client
: requested "waitForSearcher"), or at least a missing feature (commits beging
: the only UpdateRequests not reporting the achieved RF), which should be
: worth a JIRA Ticket.

Thanks for your analysis Michael -- I agree something better should be 
done here, and have filed SOLR-14262 for subsequent discussion...

https://issues.apache.org/jira/browse/SOLR-14262

I believe the reason the local commit is ignored during replay is to 
ensure a consistent view of the index -- if the tlog being 
replayed contains COMMIT1,A,B,C,COMMIT2,D,... we should never open a new 
searcher containing just A or just A+B w/o C if a COMMIT3 comes along 
during replay -- but agree with you 100% that either commit should support 
'rf' making it obvious that this commit didn't succeed (which would also 
be important & helpful if the node was still down when the client sends 
the commit) ... *AND* ... we should consider making the commit block until 
replay is finished.

...BUT... there are probably other nuances i don't understand ... 
hoepfully other folks more familiar with the current implementation will 
chime in on the jira.




-Hoss
http://www.lucidworks.com/


Re: Bug? Documents not visible after sucessful commit - chaos testing

2020-02-05 Thread Chris Hostetter


I may be missunderstanding something in your setup, and/or I may be 
miss-remembering things about Solr, but I think the behavior you are 
seeing is because *search* in solr is "eventually consistent" -- while 
"RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"

ie: there's a reason it's called "Near Real Time Searching" and "NRT 
Replica" ... not "RT Replica"

When you kill a node hosting a replica, then send an update which a leader 
accepts but can't send to that replica, that replica is now "out of sync" 
and will continue to be out of sync when it comes back online and starts 
responding to search requests as it recovers from the leader/tlog -- 
eventually the search will have consistent results across all replicas, 
but during the recovery period this isn't garunteed.

If however you use the /get request handler, then it (again, IIRC) 
consults the tlog for the latest version of the doc even if it's 
mid-recovery and the index itself isn't yet up to date.

So for the purposes of testing solr as a "strongly consistent" document 
store, using /get?id=foo to check the "current" data in the document is 
more appropriate then /select?q=id:foo

Some more info here...

https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html
https://lucene.apache.org/solr/guide/8_4/realtime-get.html


A few other things that jumped out at me in your email that seemed weird 
or worthy of comment...

: Accordung to solrs documentation, a commit with openSearcher=true and
: waitSearcher=true and waitFlush=true only returns once everything is
: presisted AND the new searcher is visible.
: 
: To me this sounds like that any subsequent request after a successful
: commit MUST hit the new searcher and is guaranteed to see the commit
: changes, regardless of node failures or restarts.

that is true for *single* node solr, or a "heathy" cluster but as i 
mentioned if a node is down when the "commit" happens it won't have the 
document yet -- nor is it alive to process the commit.  the document 
update -- and the commit -- are in the tlog that still needs to replay 
when the replica comes back online

:- A test-collection with 1 Shard and 2 NRT Replicas.

I'm guessing since you said you were using 3 nodes, that what you 
mean here is a single shard with a total of 3 replicas which are all NRT 
-- remember the "leader" is still itself an NRT  replica.  

(i know, i know ... i hate the terminology) 

This is a really important point to clarify in your testing because of how 
you are using 'rf' ... seeing exactly how you create your collection is 
important to make sure we're talking about the same thing.

: Each "transaction" adds, modifys and deletes documents and we ensure that
: each response has a "rf=2" (achieved replication factor=2) attribute.

So to be clear: 'rf=2' means a total of 2 replicas confirmed the update -- 
that includes the leader replica.  'rf=1' means the leader accepted the 
doc, but all other replicas are down.

if you wnat to me 100% certain that every replica recieved the update, 
then you should be confirming rf=3

: After a "transaction" was performed without errors we send first a
: hardCommit and then a softCommit, both with waitFlush=true,
: waitSearcher=true and ensure they both return without errors.

FYI: three is no need to send a softCommit after a hardCommit -- a hard 
commit with openSearcher=true (the default) is a super-set of a soft 
commit.



-Hoss
http://www.lucidworks.com/


Re: Oracle OpenJDK to Amazon Corretto OpenJDK

2020-01-31 Thread Chris Hostetter


: Link to the issue was helpful.
: 
: Although, when I take a look at Dockerfile for any Solr version from here
: https://github.com/docker-solr/docker-solr, the very first line says
: FROM openjdk...It
: does not say FROM adoptopenjdk. Am I missing something?

Ahhh ... I have no idea, But at least now I better understand your 
concern.

I would suggest opening an issue / PR in the github:docker-solr repo ... 
there are plans to eventually officially move managment of docker-solr 
into the Apache Lucene/Solr project, but for now it's an independent 
packaging effort...

https://github.com/docker-solr/docker-solr/
https://github.com/docker-solr/docker-solr/issues/276

...in the meantime: If you can't use openjdk, then as far as i understand 
how docker images work, you'd need to build your own using a patched 
Dockerfile.



-Hoss
http://www.lucidworks.com/


Re: How do I send multiple user version parameter value for a delet by id request with multiple IDs ?

2020-01-31 Thread Chris Hostetter


: Subject: How do I send multiple user version parameter value for a delet by id
:  request with multiple IDs ?

If you're talking about Solr's normal optimistic concurrency version 
constraints then you just pass '_version_' with each delete block...

https://lucene.apache.org/solr/guide/8_4/uploading-data-with-index-handlers.html#sending-json-update-commands

{"delete":[{"id":"51", "_version_":123445},
   {"id":"5", "_version_":67890}]}

...but it shounds like maybe you're talking about using the 
DocBasedVersionConstraintsProcessorFactory ? ... in which case I don't 
think what you're asking about is possible ... I'm pretty sure it's 
deletion logic assumes you'll send only one delete at a time.

As a workaround, you can send the "tombstone" documents yourself (instead 
of relying on DocBasedVersionConstraintsProcessorFactory to intercept the 
deleteById commands and convert then into tombstones for you.)


-Hoss
http://www.lucidworks.com/


Re: Oracle OpenJDK to Amazon Corretto OpenJDK

2020-01-31 Thread Chris Hostetter

Just upgrade?

This has been fixed in most recent versions of AdoptOpenJDK builds...
https://github.com/AdoptOpenJDK/openjdk-build/issues/465

hossman@slate:~$ java8
hossman@slate:~$ java -XshowSettings:properties -version 2>&1 | grep -e 
vendor -e version
java.class.version = 52.0
java.runtime.version = 1.8.0_222-b10
java.specification.vendor = Oracle Corporation
java.specification.version = 1.8
java.vendor = AdoptOpenJDK
java.vendor.url = http://java.oracle.com/
java.vendor.url.bug = http://bugreport.sun.com/bugreport/
java.version = 1.8.0_222
java.vm.specification.vendor = Oracle Corporation
java.vm.specification.version = 1.8
java.vm.vendor = AdoptOpenJDK
java.vm.version = 25.222-b10
os.version = 5.0.0-32-generic
openjdk version "1.8.0_222"


hossman@slate:~$ java11
hossman@slate:~$ java -XshowSettings:properties -version 2>&1 | grep -e 
vendor -e version
java.class.version = 55.0
java.runtime.version = 11.0.4+11
java.specification.vendor = Oracle Corporation
java.specification.version = 11
java.vendor = AdoptOpenJDK
java.vendor.url = https://adoptopenjdk.net/
java.vendor.url.bug = https://github.com/AdoptOpenJDK/openjdk-build/issues
java.vendor.version = AdoptOpenJDK
java.version = 11.0.4
java.version.date = 2019-07-16
java.vm.specification.vendor = Oracle Corporation
java.vm.specification.version = 11
java.vm.vendor = AdoptOpenJDK
java.vm.version = 11.0.4+11
os.version = 5.0.0-32-generic
openjdk version "11.0.4" 2019-07-16




: Date: Fri, 31 Jan 2020 12:45:36 -0500
: From: Arnold Bronley 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Oracle OpenJDK to Amazon Corretto OpenJDK
: 
: Thanks for the helpful information. It is a no-go because even though it is
: OpenJDK and free, vendor is Oracle and legal dept. at our company is trying
: to get away from anything Oracle.
: It is little paranoid reaction, I agree.
: 
: See the java.vendor property in following output.
: 
: $ java -XshowSettings:properties -version
: Property settings:
: awt.toolkit = sun.awt.X11.XToolkit
: file.encoding = UTF-8
: file.encoding.pkg = sun.io
: file.separator = /
: java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
: java.awt.printerjob = sun.print.PSPrinterJob
: java.class.path = .
: java.class.version = 52.0
: java.endorsed.dirs = /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/endorsed
: java.ext.dirs = /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext
: /usr/java/packages/lib/ext
: java.home = /usr/lib/jvm/java-8-openjdk-amd64/jre
: java.io.tmpdir = /tmp
: java.library.path = /usr/java/packages/lib/amd64
: /usr/lib/x86_64-linux-gnu/jni
: /lib/x86_64-linux-gnu
: /usr/lib/x86_64-linux-gnu
: /usr/lib/jni
: /lib
: /usr/lib
: java.runtime.name = OpenJDK Runtime Environment
: java.runtime.version = 1.8.0_181-8u181-b13-1~deb9u1-b13
: java.specification.name = Java Platform API Specification
: java.specification.vendor = Oracle Corporation
: java.specification.version = 1.8
: java.vendor = Oracle Corporation
: java.vendor.url = http://java.oracle.com/
: java.vendor.url.bug = http://bugreport.sun.com/bugreport/
: java.version = 1.8.0_181
: java.vm.info = mixed mode
: java.vm.name = OpenJDK 64-Bit Server VM
: java.vm.specification.name = Java Virtual Machine Specification
: java.vm.specification.vendor = Oracle Corporation
: java.vm.specification.version = 1.8
: java.vm.vendor = Oracle Corporation
: java.vm.version = 25.181-b13
: line.separator = \n
: os.arch = amd64
: os.name = Linux
: os.version = 4.9.0-8-amd64
: path.separator = :
: sun.arch.data.model = 64
: sun.boot.class.path =
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/resources.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/sunrsasign.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jsse.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/charsets.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jfr.jar
: /usr/lib/jvm/java-8-openjdk-amd64/jre/classes
: sun.boot.library.path = /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64
: sun.cpu.endian = little
: sun.cpu.isalist =
: sun.io.unicode.encoding = UnicodeLittle
: sun.java.launcher = SUN_STANDARD
: sun.jnu.encoding = UTF-8
: sun.management.compiler = HotSpot 64-Bit Tiered Compilers
: sun.os.patch.level = unknown
: user.country = US
: user.dir = /opt/solr
: user.home = /home/solr
: user.language = en
: user.name = solr
: user.timezone =
: 
: openjdk version "1.8.0_181"
: OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
: OpenJDK 64-Bit Server VM (build 

Re: Query Containing Multiple Parsers

2019-12-17 Thread Chris Hostetter


: Is there a way to construct a query that needs two different parsers?
: Example:
: q={!xmlparser}Hello
: AND
: q={!edismax}text_en:"foo bar"~4

The easies way to do what you're asking about would be to choose one of 
those queries for "storking" purposes, and put the other one in an "fq" 
simply for filtering.

But you can build a single compelx query using multiple parsers by 
leveraging the "lucene" parser's support for nesting queries -- ie: in a 
larger boolean query -- and then use local param variables to reference 
your other param names...


q=({!edismax qf=text_en v=$my_main_query} OR {!xmlparser v=$my_span_query})
my_main_query="foo bar"~4
my_span_query=Hello

...the important bits that tend to trip people up is to make sure you 
don't start your query string with the local param syntax of another 
parser, and that you don't pass the input of your nested parsers "inline" 
if they contain white space .. hence the parens above and the use of the 
'v' local param.

If you tried to do the same thing like either of these queries below, it 
wouldn't work because it would confuse the parsing logic...

bad_q1={!edismax qf=text_en v=$my_main_query} OR {!xmlparser v=$my_span_query}

bad_q2=({!edismax qf=text_en}"foo bar"~4 OR {!xmlparser v=$my_span_query})

in "bad_q1" solr would think you wanted the *entire* param value 
(including the "OR {!xmlparser..." passed to the "edismax" parsers

in "bad_q2" the nested edismax parser would only be given the input '"foo" 
.. and not the ' bar"~4' bit, because the outer most (implicit) lucene 
parser doens't understand how much of the input you intended for the 
nested parser.


-Hoss
http://www.lucidworks.com/


Re: Using Deep Paging with Graph Query Parser

2019-12-17 Thread Chris Hostetter


: Is there a way to use combine paging's cursor feature with graph query
: parser?

it should work just fie -- the cursorMark logic doesn't care what query 
parser you use.

Is there a particular problem you are running into when you send requests 
using both?


-Hoss
http://www.lucidworks.com/


Re: using scoring to find exact matches while using a cursormark

2019-11-18 Thread Chris Hostetter


: If I use the following query:in the browser, I get the expected results at
: the top of the returned values from Solr.
: 
: {
:   "responseHeader":{
: "status":0,
: "QTime":41,
: "params":{
:   "q":"( clt_ref_no:OWL-2924-8 ^2 OR contract_number:OWL-2924-8^2 )",
:   "indent":"on",
:   "fl":"clt_ref_no, score",
:   "rows":"1000"}},

...so w/o a score param you're getting the default sort: score "desc" 
(descending)...

https://lucene.apache.org/solr/guide/8_3/common-query-parameters.html#CommonQueryParameters-ThesortParameter

"If the sort parameter is omitted, sorting is performed as though the 
parameter were set to score desc."

: If I add ihe sorting needed for cursor, my results change
: dramatically, and the exact matches are not at the top of the stack.
...
: {
:   "responseHeader":{
: "status":0,
: "QTime":80,
: "params":{
:   "q":"( clt_ref_no:OWL-2924-8 ^2 OR contract_number:OWL-2924-8^2 )",
:   "indent":"on",
:   "fl":"clt_ref_no, score",
:   "sort":"score asc, id asc",

...you've requested sort "asc" (ascendeing) which means you're explicitly 
requesting the lowest scoring documents first.

Did you see some example somewhere suggesting that you needed to sort by 
"score asc"  

(there are only a few niche cases where sorting by "score asc" is ever 
moderately useful)


-Hoss
http://www.lucidworks.com/


Re: different results in numFound vs using the cursor

2019-11-12 Thread Chris Hostetter


: > whoa... that's not normal .. what *exactly* does the fieldType declaration
: > (with all analyzers) look like, and what does the  declaration
: > look like?
: >
: >
: 
: 
: 

NOTE: "text_general" != "text_gen_sort"

Assuming your "text_general" declaration looks like it does in the 
_default config set, then using that for uniqueKey or sorting is definitly 
not a good idea.

If you were *actually* using SortableTextField for your uniqueKeyField ... 
well, that should be ok to *sort* on, but i still wouldn't suggest using 
it as a uniqueKey field ... honestly not sure what behavior that might 
have with things like deleteById, etc...


: I am going to adjust my schema, re-index, and try again. See if that
: doesn't fix this problem. I didn't know that having the uniqueKey be a
: textField was a bad idea.

https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey

"The fieldType of uniqueKey must not be analyzed"

(hence my comment baout "possible, but hard to get right ... you can use 
something like the KeywordTokenizer, but at that point you might as well 
use StrField except in some really esoteric special situations)



-Hoss
http://www.lucidworks.com/


Re: different results in numFound vs using the cursor

2019-11-12 Thread Chris Hostetter


: > a) What is the fieldType of the uniqueKey field in use?
: >
: 
: It is a textField

whoa... that's not normal .. what *exactly* does the fieldType declaration 
(with all analyzers) look like, and what does the  declaration 
look like?

you should really never use TextField for a uniqueKey ... it's possible, 
but incredibly tricky to get "right".

Independent from that, "sorting" on a TextField doesn't always do what you 
might think (again: depending on the analysis in use)

With a cursorMark you have other factors to consider: i bet what's 
happening is that the post-analysis terms for your docs result it 
duplicate values, so the cursorMark is skipping all docs that have hte 
same (post analysis) sort value ... this could also manifest itself in 
other weird ways, like trying to deleteById.

Step #1: switch to using a simple StrField for your uniqueKey field and 
see if htat solves all your problems.


-Hoss
http://www.lucidworks.com/


Re: different results in numFound vs using the cursor

2019-11-11 Thread Chris Hostetter


Based on the info provided, it's hard to be certain, but reading between 
the lines here are hte assumptions i'm making...

1) your core name is "dbtr"
2) the uniqueId field for the "dbtr" core is "debtor_id"

..are those assumptions correct?

Two key pieces of information that doesn't seem to be assumable from the 
imfo you've provided:

a) What is the fieldType of the uniqueKey field in use?
b) how are you determining that "The numFound: 35008"

...

You show the code that prints out "size of solrResults: 22006" but nothing 
in your code ever prints $numFound.  there is a snippet of code at the top 
of your perl logic that seems disconnected from the rest of the code which 
makes me think that before you do anything with a cursor you are already 
parsing some *other* query response to get $numFound that way...

: i am using this logic in perl:
: 
: my $decoded = decode_json( $solrResponse->{_content} );
: my $numFound = $decoded->{response}{numFound};
: 
: $cursor = "*";
: $prevCursor = '';
: 
: while ( $prevCursor ne $cursor )
: {
:   my $solrURI = "\"http://[SOLR URL]:8983/solr/";
:   $solrURI .= $fdat{core};
...

...what exactly does all the code *before* this look like? what is the 
request that you are using to get that initial '$solrResponse' that you 
are parsing to extract '$numFound'  are you sure it's exactly the same as 
the query whose cursor you are iterating over?

It looks like you are (also) extracting 'my $numFound = 
$decoded->{response}{numFound};' on every (cusor) request ... what do you 
get if add this to your cursor loop...

   print STDERR "numFound = $numFound at '$cursor'";


...because unless documents are being added/deleted as you iterate over 
hte cursor, the numFound value should be consistent on each request.


-Hoss
http://www.lucidworks.com/


Re: Printing NULL character in log files.

2019-11-11 Thread Chris Hostetter


: Some of the log files that Solr generated contain <0x00> (null characters)
: in log files (like below)

I don't know of any reason why solr would write any null bytes to the 
logs, and certainly not in either of the places mentioned in your examples 
(where it would be at the end of an otherwise "complete" log message).  If 
those null bytes are in fact being written by the SOlr JVM they would have 
to have come from log4j.  (the Logger abstraction would ensure that if 
they came from Solr they would still have date/time/level prefix, etc...)

A cursory bit of googling doesn't suggest any reason why log4j would write 
null bytes spuriously to the log files -- but it does suggest that some 
log rotation tools can cause this behavior due.

Are you using the default solrlog4j log rotation, or some external tool?


: Does anyone have the same issue before?
: If anyone knows a way to fix this issue or a cause of this issue, could you
: please let me know?
: 
: Any clue will be very appreciated.
: 
: 
: [Example Log 1]
: 
: 2019-10-20 06:02:03.643 INFO  (coreCloseExecutor-140-thread-4) [
:  x:corename1] o.a.s.m.SolrMetricManager Closing metric reporters for
: registry=solr.core.corename,
: tag=4c16<0x00><0x00><0x00><0x00>...<0x00><0x00>00ff
: 2019-10-20 06:02:03.643 INFO  (coreCloseExecutor-140-thread-4) [
:   x:corename1] o.a.s.m.r.SolrJmxReporter Closing reporter
: [org.apache.solr.metrics.reporters.SolrJmxReporter@17281659: rootName =
: null, domain = solr.core.corename, service url = null, agent id = null] for
: registry solr.core.corename1/
: 
com.codahale.metrics.MetricRegistry@6c9f45cc<0x00><0x00><0x00><0x00>..(continue
: printing <0x00> untill the end of file.)
: 
: [Example Log 2]
: 
: 2019-10-27 06:02:02.891 INFO  (coreCloseExecutor-140-thread-17) [
: x:core2] o.a.s.m.r.SolrJmxReporter Closing reporter
: [org.apache.solr.metrics.reporters.SolrJmxReporter@35e76d2e: rootName =
: null, domain = solr.core.core2, service url = null, agent id = null] for
: registry solr.core.core2 / com.codahale.metrics.MetricRegistry@76be90f4
: 2019-10-27 06:02:02.891 INFO  (coreCloseExecutor-140-thread-26) [
: x:core3]<0x00><0x00><0x00><0x00><0x00><0x00><0x00><0x00>...<0x00><0x00>
: o.a.s.m.SolrMetricManager Closing metric reporters for
: registry=solr.core.TUN000, tag=34f04984
: 2019-10-27 06:02:02.891 INFO  (coreCloseExecutor-140-thread-26) [
: x:TUN000] o.a.s.m.r.SolrJmxReporter Closing reporter
: [org.apache.solr.metrics.reporters.SolrJmxReporter@378cecb: rootName =
: null, domain = solr.core.TUN000, service url = null, agent id = null] for
: registry solr.core.TUN000 / com.codahale.metrics.MetricRegistry@9c3410c
: 2019-10-27 06:02:05.063 INFO  (Thread-1) [   ] o.e.j.s.h.ContextHandler
: Stopped o.e.j.w.WebAppContext@5fbe4146
: 
{/solr,null,UNAVAILABLE}{file:///E:/apatchSolr/RCSS-basic-4.0.1/LUSOLR/solr/server//solr-webapp/webapp}
: <0x00><0x00><0x00><0x00><0x00><0x00>...(printing <0x00> until the end of
: the file)..<0x00><0x00>
: 
: 
: Sincerely,
: Kaya Ota
: 

-Hoss
http://www.lucidworks.com/


Re: Cursor mark page duplicates

2019-11-07 Thread Chris Hostetter


: I'm using Solr's cursor mark feature and noticing duplicates when paging 
: through results.  The duplicate records happen intermittently and appear 
: at the end of one page, and the beginning of the next (but not on all 
: pages through the results). So if rows=20 the duplicate records would be 
: document 20 on page1, and document 21 on page 2.  The document's id come 

Can you try to reproduce and show us the specifics of this including:

1) The sort param you're using
2) An 'fl' list that includes every field in the sort param
3) The returned values of every 'fl' field for the "duplicate" document 
you are seeing as it appears in *BOTH* pages of results -- allong with the 
cursorMark value in use on both of those pages.


: (-MM-DD HH:MM.SS)), score. In this Solr community post 
: 
(https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
 
: Shawn Heisey suggests:

...that post was *NOT* about using cursorMark -- it was plain old regular 
pagination, where even on a single core/replica you can see a document 
X get "pushed" from page#1 to page#2 by updates/additions of some other
doxument Z that causes Z to sort "before" X.

With cursors this kind of "pushing other docs back" or "pushing other docs 
forward" doesn't exist because of the cursorMark.  The only way a doc 
*should* move is if it's OWN sort values are updated, causing it to 
reposition itself.

But, if you have a static index, then it's *possible* that the last time 
your document X was updated, there was a "glitch" somewhere in the 
distributed update process, and the update didn't succeed in osme 
replicas -- so the same document may have different sort values 
on diff replicas.

: In the Solr query below for one of the example duplicates in question I 
: can see a search by the id returns only a single document. The 
: replication factor for the collection is 2 so the id will also appear in 
: this shards replica.  Taking into consideration Shawn's advice above, my 

If you've already identified a particular document where this has 
happened, then you can also verify/disprove my hypothosis by hitting each 
of the replicas that hosts this document with a request that looks like...

/solr/MyCollection_shard4_replica_n12/select?q=id:FOO=false
/solr/MyCollection_shard4_replica_n35/select?q=id:FOO=false

...and compare the results to see if all field values match


-Hoss
http://www.lucidworks.com/


Re: copyField - why source should contain * when dest contains *?

2019-10-23 Thread Chris Hostetter


: Documentation says that we can copy multiple fields using wildcard to one
: or more than one fields.

correct ... the limitation is in the syntax and the ambiguity that would 
be unresolvable if you had a wildcard in the dest but not in the source.  

the wildcard is essentially a variable.  if you have...

   source="foo" desc="*_bar"

...then solr has no idea what full field name to use as the destination 
when it seees values in a field "foo" ... should it be "1_bar" ? 
"aaa_bar" ? ... "z_bar" ? all three?

: Yes, that's what hit me initially. But, "*_x" while indexing (in XMLs)
: doesn't mean anything, right? It's only used in dynamicFields while
: defining schema to let Solr know that we would have some undeclared fields

use of wildcards in copyField is not contstrained to only 
using dynamicFields, this would be a perfectly valid copyField using 
wildcards, even if these are the only fields in the schema, and it had 
no dynamicFields at all...

  
  
  

  
  


  

: having names like this. Also, according to the documentation, we can have
: dest="*_x" when source="*_x" if I'm right. In this case, there's support
: for multiple destinations when there are multiple source.

correct.  there is support for copying from one field to another 
via a *MAPPING* -- so a single copyField declaration can go from multiple 
sources to multiple destiations, but using a wildcard in the dest
only woks with a one-to-one mapping when the wildcard also exists in the 
source.

on the flip side however, you have have a many-to-one mapping by using a 
wildcard *only* in the source

  
  
  

  

  



-Hoss
http://www.lucidworks.com/


Re: ant precommit fails on .adoc files

2019-10-08 Thread Chris Hostetter


This is strange -- I can't reproduce, and I can't see any evidence of a 
change to explain why this might have been failing 8 days ago but not any 
more.

Are you still seeing this error?

The lines in question are XML comments inside of (example) code blocks (in 
the ref-guide source), which is valid and the 
'checkForUnescapedSymbolSubstitutions' groovy function that generates the 
error below already has allowances for this posibility.

(normally putting '->' in asciidoctor source files is a bad idea and 
renders as giberish, which is why we have this check)


I wonder if it's possible that something in the local ENV where you are 
running ant is causing the groovy regex patterns to be evaluated 
differently? (ie: mismatched unix/windows line endings, LANG that doesn't 
use UTF-8, etc...)




: I've checked out lucene-solr project, branch "branch_8x"

: When I run "ant precommit" at project root, I get these validation 
: errors on "analytics.adoc" file.  Has anyone seen these before, and if 
: you knew of a fix?

: validate-source-patterns:
: 
: [source-patterns] Unescaped symbol "->" on line #46: 
solr/solr-ref-guide/src/analytics.adoc
: 
: [source-patterns] Unescaped symbol "->" on line #55: 
solr/solr-ref-guide/src/analytics.adoc


-Hoss
http://www.lucidworks.com/


Re: Solrcloud export all results sorted by score

2019-10-03 Thread Chris Hostetter


: We show a table of search results ordered by score (relevancy) that was
: obtained from sending a query to the standard /select handler. We're
: working in the life-sciences domain and it is common for our result sets to
: contain many millions of results (unfortunately). After users browse their
: results, they then may want to download the results that they see, to do
: some post-processing. However, to do this, such that the results appear in
: the order that the user originally saw them, we'd need to be able to export
: results based on score/relevancy.

What's your UI & middle layer like for this application and 
eventual "download" ?

I'm going to presume your end user facing app is reading the data from 
Solr, buffering it locally while formatting it in some user selected 
export format, and then giving the user a download link?

In which case using a cursor, and making iterative requests to solr from 
your app should work just fine...

https://lucene.apache.org/solr/guide/8_0/pagination-of-results.html#fetching-a-large-number-of-sorted-results-cursors

(The added benefit of cursors over /export is that it doesn't require doc 
values on every field you return ... which seems like something that you 
might care about if you have large (text) fields and an index growing as 
fast as you describe yours growing)


If you don't have any sort of middle layer application, and you're just 
providing a very thin (ie: javascript) based UI in front of solr, 
and need a way to stream a full result set from solr that you can give 
your end users raw direct access to ... then i think you're out of luck?


-Hoss
http://www.lucidworks.com/


Re: Lower case "or" is being treated as operator OR?

2019-08-07 Thread Chris Hostetter


: I think by "what query parser" you mean this:

no, that's the fieldType -- what i was refering to is that you are in fact 
using "edismax", but with solr 8.1 lowercaseOperators should default to 
"false", so my initial guess is probably wrong.

: By "request parameter" I think you are asking what I'm sending to Solr?  if
: sow I'm sending it the raw text of "or" or "OR".  In case you mean my
: request-handler, it is this:

i mean all of it -- including any other request params your client may be 
sending to solr that overrides those defaults you just posted.

the best thing to do to make sense of this is add 
"echoParams=all" and "debug=true" to your request, and show us the 
full response, along with some details of what docs in that result you 
don't expect to match, so we can look at:

1) what params come back in the responseHeader, so we can sanity check 
exactly what query string(s) are getting sent to solr, and that 
nothing is overriding lowercaseOperators, etc...

2) what comes back in the query debug section, so we can sanity check how 
your query strings are getting parsed

2) what the "explain" output looks like for those docs you are getting 
that you don't expect, so we can see why they matched.


FWIW: you mentioned "My default operator is AND" ... but that's not 
visible in the requestHandler defaults you posted -- so where is it being 
set?  (maybe it's not being set like you think it is?)



-Hoss
http://www.lucidworks.com/


Re: Lower case "or" is being treated as operator OR?

2019-08-07 Thread Chris Hostetter


what version of solr?
what query parser are you using?
what do all of your request params (including defaults) look like?

it's possible you are seeing the effects of edismax's "lowercaseOperators" 
param, which _should_ default to "false" in modern solr, but 
in very old versions it defaulted to "true" (inspite of what the docs at 
the time said)...

https://lucene.apache.org/solr/guide/8_1/the-extended-dismax-query-parser.html
https://issues.apache.org/jira/browse/SOLR-4646


: Date: Wed, 7 Aug 2019 19:32:02 -0400
: From: Steven White 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Lower case "or" is being treated as operator OR?
: 
: Hi everyone,
: 
: My schema is setup to index all words (no stop-words such as "or", "and",
: etc.) are removed.  My default operator is AND.  But when I search for "one
: or two" (without the quotes as this is not a phrase search) I'm getting
: hits on documents that have either "one" or "two".  It has the same effect
: as if I searched for "one OR two".  Any idea why?
: 
: Where should I look to see what's causing this issue?  What part of my
: schema or request handler do you need to see?
: 
: In case this helps.  Searching for just "or" or "OR" (with or without
: quests) gives me the same set of hits and ranking.  The same is also true
: for "and" or "AND".
: 
: Thanks.
: 
: Steven
: 

-Hoss
http://www.lucidworks.com/


Re: Not able reproduce race condition issue to justify implementation of optimistic concurrency

2018-11-16 Thread Chris Hostetter


1) depending on the number of CPUs / load on your solr server, it's 
possible you're just getting lucky. it's hard to "prove" with a 
multithreaded test that concurrency bugs exist.

2) a lot depends on what your updates look like (ie: the impl of 
SolrDocWriter.atomicWrite()), and what the field definitions look like.  

If you are in fact doing "atomic updates" (ie: sending a "set" command on 
the field) instead of sending the whole document *AND* if the fields f1 & 
f2 are fields that only use docValues (ie: not stored or indexed) then 
under the covers you're getting an "in-place" update in which (IIRC) it's 
totally safe for the 2 updates to happen concurrently to *DIFFERENT* 
fields of the same document.

Where you are almost certainly going to get into trouble, even if you are 
leveraging "in-place" updates under the hood, is if 2 diff threads try to 
update the *SAME* field -- even if the individual threads don't try to 
assert that the final count matches their expected count, you will likely 
wind up missing some updates (ie: the final value may not be equal the sum 
of the total incremements from both threads)

Other problems will exist in cases where in-place updates can't be used 
(ie: if you also updated a String field when incrememebting your numeric 
counter)

The key thing to remember is that there is almost no overhead in using 
optimistic concurrency -- *UNLESS* you encounter a collision/failure.  If 
you are planning on having concurrent indexing clients reading docs from 
solr, modifying them, and writing back to solr -- and there is a change 
multiple client threads will touch the same document, then the slight 
addition of optimistic concurrency params to the updates & retrying on 
failure is a trivial addition to the client code, and shouldn't have a 
noticable impact on performance.



: Before implementing optimistic concurrency solution, I had written one test
: case to check if two threads atomically writing two different fields (say
: f1 and f2) of the same document (say d) run into conflict or not.
: Thread t1 atomically writes counter c1 to field f1 of document d, commits
: and then reads the value of f1 and makes sure that it is equal to c1. It
: then increments c1 by 1 and resumes until c1 reaches to say 1000.
: Thread t2 does the same, but with counter c2 and field f2 but with same
: document d.
: What I observed is the assertion of f1 = c1 or f2 = c2 in each loop never
: fails.
: I increased the max counter value to even 10 instead of mere 1000 and
: still no conflict
: I was under the impression that there would often be conflict and that is
: why I will require optimistic concurrency solution. How is this possible?
: Any idea?
: 
: Here is the test case code:
: 
: https://pastebin.com/KCLPYqeg
: 

-Hoss
http://www.lucidworks.com/


Re: ManagedIndexSchema Bad version when trying to persist schema

2018-10-29 Thread Chris Hostetter

:  Hi Erick,Thanks for your reply.No, we aren't using schemaless 
: mode.   is not explicitly declared in 
: our solrconfig.xmlAlso we have only one replica and one shard.

ManagedIndexSchemaFactory has been the default since 6.0 unless an 
explicit schemaFactory is defined...

https://lucene.apache.org/solr/guide/7_5/major-changes-from-solr-5-to-solr-6.html

https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html


-Hoss 

http://www.lucidworks.com/

Re: Does SolrJ support JSON DSL?

2018-10-05 Thread Chris Hostetter


: There's nothing out-of-the-box.

Which is to say: there are no explicit convenience methods for it, but you 
can absolutely use the JSON DSL and JSON facets via SolrJ and the 
QueryRequest -- just add the param key=value that you want, where the 
value is the JSON syntax...

ModifiableSolrParams p = new ModifiableSolrParams()
p.add("json.facet","{ ... }");
// and/or: p.add("json", "{ ... }");
QueryRequest req = new QueryRequest(p, SolrRequest.METHOD.POST);
QueryResponse rsp = req.process(client);

: On Fri, Oct 5, 2018 at 5:34 PM Alexandre Rafalovitch 
: wrote:
: 
: > Hi,
: >
: > Does anybody know if it is possible to do the new JSON DSL and JSON
: > Facets requests via SolrJ. The SolrJ documentation is a bit sparse and
: > I don't often use it. So, I can't figure out if there is a direct
: > support or even a pass-through workaround.
: >
: > Thank you,
: >Alex.
: >
: 
: 
: -- 
: Sincerely yours
: Mikhail Khludnev
: 

-Hoss
http://www.lucidworks.com/


Re: Atomic Update Failure With solr.UUID Field

2018-09-17 Thread Chris Hostetter


My suggestion:

* completley avoid using UUIDField
* use StrField instead
* use the UUIDUpdateProcessorFactory if you want solr to generate the 
UUIDs for you when adding a new doc.

The fact that UUIDField internally passes values around as java.util.UUID 
objects (and other classes like it that don't stick to java "primative" 
ovalues) is the source of a large amount of pain in various places of the 
code base, with almost not value add to end users.


: Date: Wed, 29 Aug 2018 11:11:58 -0700
: From: Stephen Lewis Bianamara 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Atomic Update Failure With solr.UUID Field
: 
: Hi All,
: 
: Just checking back in. Did anyone have a chance to take a look? Would love
: to get some help here. My design requires docs with many UUIDs which should
: not need to be updated each time and should be optimally performant for
: filters. So I think this bug is currently a hard blocker for me to be able
: to use SOLR :( Is anyone from the SOLR community able to assist? I've
: gathered some additional data in the mean time, and I would really
: appreciate someone familiar with the area taking a look.
: 
: Here are my additional discoveries
: 
:1. Turning on doc values and turning off stored, atomic updates work as
:they're supposed to with UUID
:2. Turning on doc values and turning on stored, atomic updates break as
:before with UUID. Thus it is 100% an effect of turning on stored.
:3. The error is being thrown here
:

:.
: 
: From the point that the error is thrown, I see a couple of possible options
: as to what the fix may be. However, I'm relatively new to the innards of
: the SOLR stack and only an occasional Java dev, so I'd love some guidance
: on the matter.
: 
: Perhaps the fix is to make java.Util.UUID implement BytesRef? Perhaps the
: fix is to add another bit of logic after the " if (o instanceof BytesRef) "
: conditional block. Something like, cast the object to a UUID and then
: serialize to a byte array?
: 
: Cheers,
: Stephen
: 
: On Wed, Aug 22, 2018 at 8:53 AM Stephen Lewis Bianamara <
: stephen.bianam...@gmail.com> wrote:
: 
: > Hello again! I found a thread which seems relevant. It looks like someone
: > else found this occurred as well, but did not follow up with repro steps.
: > But I did! :)
: >
: >
: > 
http://lucene.472066.n3.nabble.com/TransactionLog-doesn-t-know-how-to-serialize-class-java-util-UUID-try-implementing-ObjectResolver-td4332277.html
: >
: > Would love to work together to get this fixed.
: >
: > On Tue, Aug 21, 2018 at 6:50 PM Stephen Lewis Bianamara <
: > stephen.bianam...@gmail.com> wrote:
: >
: >> Hello SOLR Community,
: >>
: >> I'm prototyping a collection on SOLR 6.6.3 with UUID fields, and I'm
: >> hitting some trouble with atomic updates. At a high level, here's the
: >> problem: suppose you have a schema with an optional field of type solr.UUID
: >> field, and a document with a value for that field. Any atomic update on
: >> that document which does not contain the UUID field will fail. Below I
: >> provide an example and then an exact set of repro steps.
: >>
: >> So for example, suppose I have the following doc: {"Id":1,
: >> "SomeString":"woof", "MyUUID":"617c7768-7cc3-42d0-9ae1-74398bc5a3e7"}. If I
: >> run an atomic update on it like {"Id":1,"SomeString":{"set":"meow"}}, it
: >> will fail with message "TransactionLog doesn't know how to serialize class
: >> java.util.UUID; try implementing ObjectResolver?"
: >>
: >> Is this a known issue? Precise repro below. Thanks!
: >>
: >> Exact repro
: >> -
: >> 1. Define collection MyCollection with the following schema:
: >>
: >> 
: >>   
: >> 
: >> 
: >> 
: >> 
: >>   
: >>   Id
: >>   
: >> 
: >> 
: >>   
: >>
: >> 2. Create a document {"Id":1, "SomeString":"woof"} in the admin UI
: >> (MyCollection > Documents > /update). The update succeeds and the doc is
: >> searchable.
: >> 3. Apply the following atomic update. It succeeds. {"Id":1,
: >> "SomeString":{"set":"bark"}}
: >> 4. Add a value for MyUUID (either with atomic update or regular). It
: >> succeeds. {"Id":1,  
"MyUUID":{"set":"617c7768-7cc3-42d0-9ae1-74398bc5a3e7"}}
: >> 5. Try to atomically update just the SomeString field. It fails.
: >> {"Id":1,  "SomeString":{"set":"meow"}}
: >>
: >> The error that happens on failure is the following.
: >>
: >> Status: 
{"data":{"responseHeader":{"status":500,"QTime":2},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"TransactionLog
: >> doesn't know how to serialize class java.util.UUID; try implementing
: >> ObjectResolver?","trace":"org.apache.solr.common.SolrException:
: >> TransactionLog doesn't know how to serialize class java.util.UUID; try
: 

Re: Boost matches occurring early in the field (offset)

2018-09-17 Thread Chris Hostetter


: I have seen that one. But as I understand spanFirst, it only allows you 
: to define a boost if your span matches, i.e. not a gradually lower score 
: the further down in the document the match is?

I believe you are incorrect.

Unless something has drastically changed in SpanQuery in the past few 
years, all SpanQueries automatically "boost" the resulting scores of 
matching documents based on the "width" of the spans that match -- similar 
to how a phrase query with a high slop value will score higher for a doc 
with one "tight" match then on a doc with one "loose" match...

https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/similarities/Similarity.SimScorer.html

So in the specific case of SpanFirst -- any matching span is not 
only anchored (on the left) at the start of the field value, and (on the 
right) by at most max term position value specified, but the closer the 
sub-span match is to the start of the field value, the smaller the 
resulting Span, and the higher the score.

(If this general relationsihp of Span "width" to score isn't clear from 
the high level jdocs, then it should probably be called out better? ... 
i'm not sure if it's particulalry clear/obvious inthe PhraseQuery jdocs 
either)



-Hoss
http://www.lucidworks.com/


Re: Unable to enable SSL with self-sign certs

2018-09-12 Thread Chris Hostetter


: WARN: (main) AbstractLifeCycle FAILED org.eclipse.jetty.server.Server@...
: java.io.FileNotFoundException: /opt/solr-5.4.1/server (Is a directory)
: java.io.FileNotFoundException: /opt/solr-5.4.1/server (Is a directory)
: at java.io.FileInputStream.open0(Native Method)
: at java.io.FileInputStream.open(FileInputStream.java:195) 
: 
: The above jks is in the etc folder (/opt/solr-5.4.1/server/etc) and the
: permissions are 644. The settings in the /etc/default/solr.in.sh file are as
: follows:

What are the owner/group/perms of all the following...

/opt/solr-5.4.1/server/etc/solr-ssl.keystore.jks
/opt/solr-5.4.1/server/etc
/opt/solr-5.4.1/server
/opt/solr-5.4.1
/opt

...because my best guess for why be a read perms issue on "solr-5.4.1" 
preventing it from "finding" the server directory inside of it?



-Hoss
http://www.lucidworks.com/


Re: MetricsHistoryHandler getOverseerLeader fails when hostname contains hyphen

2018-07-25 Thread Chris Hostetter
: Subject: MetricsHistoryHandler getOverseerLeader fails when hostname contains
: hyphen

that's unfortunate.  I filed a jira...

https://issues.apache.org/jira/browse/SOLR-12594

: Can one just ignore this warning and what will happen then?

I think as long as you don't care about the mstrics history reporting 
(which collects long term metrics to rollup and see changes over time) you 
can probably ignore that warning...

https://lucene.apache.org/solr/guide/7_4/metrics-history.html


-Hoss
http://www.lucidworks.com/


Re: cursorMark and sort order

2018-07-25 Thread Chris Hostetter


: For deep pagination, it is recommended that we use cursorMark and 
: provide a sort order for  as a tiebreaker.
: 
: I want my results in relevancy order and so have no sort specified on my 
query by default.
: 
: Do I need to explicitly set :
: 
:   sort : score desc,  asc

Yes.

: Or can I get away with just :
: 
:   sort =  asc
: 
: and have Solr understand that the sort is only for tie break purposes?

No, if you use the later, solr will assume you don't care about scores at 
all.



-Hoss
http://www.lucidworks.com/


Re: Possible to define a field so that substring-search is always used?

2018-07-24 Thread Chris Hostetter


: We are using Solr as a user index, and users have email addresses.
: 
: Our old search behavior used a SQL substring match for any search
: terms entered, and so users are used to being able to search for e.g.
: "chr" and finding my email address ("ch...@christopherschultz.net").
: 
: By default, Solr doesn't perform substring matches, and it might be
: difficult to re-train users to use *chr* to find email addresses by
: substring.

In the past, were you really doing arbitrary substring matching, or just 
prefix matching?  ie would a search for "sto" match 
"ch...@christopherschultz.net"

Personally, if you know you have an email field, would suggest using a 
custom tokenizer that splits on "@" and "." (and maybe other punctuation 
characters like "-") and then take your raw user input and feed it to the 
prefix parser (instead of requiring your users to add the "*")...

 q={!prefix f=email v=$user_input}_input=chr

...which would match ch...@gmail.com, f...@chris.com, f...@bar.chr etc. 

(this wouldn't help you though if you *really* want arbitrary substring 
matching -- as erick suggested ngrams is pretty much your best bet for 
something like that)

Bear in mind, you can combine that "forced prefix" query against 
the (otkenized) email field with other queries that 
could parse your input in other ways...

user_input=...
q=({!prefix f=email v=$user_input} 
   OR {!dismax qf="first_name last_name" ..etc.. v=$user_input})

so if your user input is "chris" you'll get term matches on the 
first_name field, or the last_name field as well as prefix matches on the 
email field.



-Hoss
http://www.lucidworks.com/


Re: Alias field names when searching (not for results)

2018-07-24 Thread Chris Hostetter


: >  defType=edismax q=sysadmin name:Mike qf=title text last_name
: > first_name
: 
: Aside: I'm curious about the use of "qf", here. Since I didn't want my
: users to have to specify any particular field to search, I created an
: "all" field and dumped everything into it. It seems like it would be
: better to change that so that I don't have an "all" field at all and
...
: Does that sound like a better approach than packing-together an "all"
: field during indexing?

well -- you may have other reasons why an "all" field is useful, but yeah 
-- when using dismax/edismax the "qf" param is really designed to let you 
search across many diff fields, and to associate query time weights with 
those fields.  see the docs i linked to earlier, but there's also a blog 
post on the scoring implications i wrote a lifetime ago...

https://lucidworks.com/2010/05/23/whats-a-dismax/

: > ...the examples above all show the request params, so "f.last.qf"
: > is a param name, "last_name" is the corrisponding param value.
: 
: Awesome. I didn't realize that "f.alias.qf" was the name of the actual
: parameter to send. I was staring at the Solr Dashboard's selection of
: edismax parameters and not seeing anything that seemed correct. That's
: because it's a new parameter! Makes sense, now.

that syntax is an example of a "per field override" where in this case the 
"field" you are overriding doesn't *have* to be a "real" field in the 
index -- it can be an alias and for that alias (when used by your users) 
you are defining the qf to use.  it could in fact be a "real" field name, 
where you override what gets searched "I'm not going to let them search 
directly against just the last_name, when they try i'm going to *actually* 
search against last_name and full_name" etc...)


-Hoss
http://www.lucidworks.com/


Re: [EXTERNAL] Re: Facet Sorting

2018-07-24 Thread Chris Hostetter

: Chris, I was trying the below method for sorting the faceted buckets but 
: am seeing that the function query query($q) applies only to the score 
: from “q” parameter. My solr request has a combination of q, “bq” and 
: “bf” and it looks like the function query query($q) is calculating the 
: scores only on q and not on the aggregate score of q, bq and bf

right.  ok -- yeah, that makes sense.

The thing to udnerstand is that when you use request params as "variables" 
in functions like that, the function doesn't know the context of your 
request -- "query($q)" doesn't know/treat the "q" param special, it could 
just as easily be "query($make_up_a_param_name_thats_in_your_request)"

awhen when the query() function goes and evalutes the param you specify, 
it's not going to know that you have a defType of e/dismax that affects 
"q" param when the main query is executed -- it just parses it as a lucene 
query.

so what you need is something like "query({!dismax bf=$bf bq=$bq v=$q})" 
... i think that should work, or if not then use "query($facet_sort)" 
where facet_sort is a new param you add that contains "{!dismax bf=$bf 
bq=$bq v=$q}"

alternatively, you could change your "q" param to be very explicit about 
the query you want, w/o depending on defType, and use a custom param name 
for the original query string provided by the user -- that's what i 
frequently do...

   ie: q={!dismax bf=$bf bq=$bq v=$qq}=dogs and cats

...and then the "query($q)" i suggested before should work as is.

does that make sense?


-Hoss
http://www.lucidworks.com/

Re: Alias field names when searching (not for results)

2018-07-24 Thread Chris Hostetter


: So if I want to alias the "first_name" field to "first" and the
: "last_name" field to "last", then I would ... do what, exactly?

se the last example here...

https://lucene.apache.org/solr/guide/7_4/the-extended-dismax-query-parser.html#examples-of-edismax-queries

defType=edismax
q=sysadmin name:Mike
qf=title text last_name first_name
f.name.qf=last_name first_name

the "f.name.qf" has created an "alias" so that when the "q" contains 
"name:Mike" it searches for "Mike" in both the last_name and first_name 
fields.  if it were "f.name.qf=last_name first_name^2" then there would be 
a boost on matches in the first_name field.

For your usecase you want something like...

defType=edismax
q=sysadmin first:Mike last:Smith
qf=title text last_name first_name
f.first.qf=first_name
f.last.qf=last_name

: I'm using SolrJ as the client.

...the examples above all show the request params, so "f.last.qf" is a 
param name, "last_name" is the corrisponding param value.



-Hoss
http://www.lucidworks.com/


Re: Need an advice for architecture.

2018-07-19 Thread Chris Hostetter


: FWIW: I used the script below to build myself 3.8 million documents, with 
: 300 "text fields" consisting of anywhere from 1-10 "words" (integers 
: between 1 and 200)

Whoops ... forgot to post the script...


#!/usr/bin/perl

use strict;
use warnings;

my $num_docs = 3_800_000;
my $max_words_in_field = 10;
my $words_in_vocab = 200;
my $num_fields = 300;

# header
print "id";
map { print ",${_}_t" } 1..$num_fields;
print "\n";

while ($num_docs--) {
print "$num_docs"; # uniqueKey
for (1..$num_fields) {
my $words_in_field = int(rand($max_words_in_field));
print ",\"";
map { print int(rand($words_in_vocab)) . " " } 0..$words_in_field;
print "\"";
}
print "\n";
}




Re: Need an advice for architecture.

2018-07-19 Thread Chris Hostetter


: SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon
: 2.1Ghz, 32GB RAM]
: Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata
: fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM]
: (atm we need 35h to build the index and about 24h for a mass update which
: affects the production)

The first question i have is why you are using a version of Solr that's 
almost 5 years old.

The second question you should consider is what your indexing process 
looks like, and whether it's multithreaded or not, and if the bottleneck 
is your network/DB.

The third question to consider is your solr configuration / schema: how 
complex the solr side indexing process is -- ie: are these 300 fields all 
TextFields with complex analyzers?

FWIW: I used the script below to build myself 3.8 million documents, with 
300 "text fields" consisting of anywhere from 1-10 "words" (integers 
between 1 and 200)

The resulting CSV file was 24GB, and using a simple curl command to index 
with a single client thread (and a single solr thread) against the solr 
7.4 running with the sample techproducts configs took less then 2 hours on 
my laptop (less CPU & half as much ram compared to your server) while i 
was doing other stuff.

(I would bet your current indexing speed has very little to do with solr 
and is largey a factor of your source DB and how you are sending the data 
to solr)


-Hoss
http://www.lucidworks.com/


Re: Document Count Difference Between Solr Versions 4.7 and 7.3

2018-07-19 Thread Chris Hostetter



: I performed a bulk reindex against one of our larger databases for the first
: time using solr 7.3. The document count was substantially less (like at
: least 15% less) than our most recent bulk reindex from th previous solr 4.7
: server. I will perform a more careful analysis, but I am assuming the
: document count should not be different against the same database, even
: accounting for the schema updates required for going from 4.7 to 7.3.

Was the exact same souce data used in both cases?  ... you mentioned "most 
recent bulk reindex" but it's not clear if the source data changed since 
that last index job.

what does your bulk indexing code look like? does it log errors from solr?

were there any errors in the solr logs?


-Hoss
http://www.lucidworks.com/


Re: Facet Sorting

2018-07-18 Thread Chris Hostetter


: If I want to plug in my own sorting for facets, what would be the best 
: approach. I know, out of the box, solr supports sort by facet count and 
: sort by alpha. I want to plug in my own sorting (say by relevancy). Is 
: there a way to do that? Where should I start with if I need to write a 
: Custom Facet Component?

it sounds like you're talking about the "classic" facets (using 
"facet.field") where facet.sort only supports "count" (desc) and "index" 
(asc)

Adding a custom sort option there would be close to impossible.

The newer "json.facets" API supports a much more robust set of options, 
that includes the ability to sort on an "aggregate" function across all 
documents in the bucket...

https://lucene.apache.org/solr/guide/7_4/json-facet-api.html

some of the existing sort options there might solve your need, but it's 
also possible using that API to write your own ValueSourceParser plugin 
that can be used to sort facets as long as it returns ValueSources that 
extend "AggValueSource"

: Basically I want to plug the scores calculated in earlier steps for the 
: documents matched, do some kind of aggregation of the scores of the 
: documents that fall under a facet and use this aggregate score to rank 

IIUC what you want is possibly something like...

curl http://localhost:8983/solr/techproducts/query -d 'q=features:lcd=0&
 json.facet={
   categories:{
 type : terms,
 field : cat,
 sort : { x : desc },
 facet:{
   x : "sum(query($q))",
 }
   }
 }
'

...which will sort the buckets by the sum of the scores of every document 
in that bucket (using the original query .. but you could alternatively 
sort by any aggregation of the scores from any arbitrary query / document 
based function)





-Hoss
http://www.lucidworks.com/


Re: SOLR 7.1 Top level element for similarity factory

2018-07-16 Thread Chris Hostetter


: So I have the following at the bottom of my schema.xml file
: 
: 
: 
: 
: 
: The documentation says "top level element" - so should that actually be 
outside the schema tag?

No, the schema tag is the "root" level element, it's direct children are 
the "top level elements"  

(the wording may not be the best possible, but the goal was to emphasis 
that to have the behavior you're looking need to make sure you don't just 
add it to a single fieldType, or mistakenly put it inside one the of the 
legacy  or  blocks)


-Hoss
http://www.lucidworks.com/


Re: CursorMarks and 'end of results'

2018-06-21 Thread Chris Hostetter


: the documentation of 'cursorMarks' recommends to fetch until a query returns
: the cursorMark that was passed in to a request.
: 
: But that always requires an additional request at the end, so I wonder if I
: can stop already, if a request returns less results than requested (num rows).
: There won't be new documents added during the search in my use case, so could
: there every be a non-empty 'page' after a non-full 'page'?

You could stop then -- if that fits your usecase -- but the documentation 
(in particular the sentence you are refering to) is trying to be as 
straightforward and general as possible ... which includes the use case 
where someone is "tailing" an index and documents may be continually 
added.

When originally writing those docs, I did have a bit in there about 
*either* getting back less then "rows" docs *or* getting back the same 
cursor you passed in (to try to cover both use cases as efficiently as 
possible) but it seemed more confusing -- and i was worried people might 
be suprised/confused when the number of docs was perfectly divisible by 
"rows" so the "less then rows" case could still wind up in a final 
request that returned "0" docs.

the current docs seemed like a good balance between brevity & clarity, 
with the added bonus of being correct :)

But as Anshum said: if you have suggested improvements for rewording, 
patches/PRs certainly welcome.  It's hard to have a good perspective on 
what docs are helpful to new users whne you have been working with the 
software for 14 years and wrote the code in question.



-Hoss
http://www.lucidworks.com/


Re: Facet Range with Stats

2018-05-16 Thread Chris Hostetter

: I'd like to generate stats for the results of a facet range.
: For example, calculate the mean sold price over a range of months.
: Does anyone know how to do this?
: This Jira issue seems to indicate its not yet possible.
: [SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA

This is possible using the JSON Facet API...

https://lucene.apache.org/solr/guide/7_3/json-facet-api.html

This newer API syntax is more expressive and easier to add new syntax to, 
so it's unlikely you'll see more features like this added to the older 
faceting API -- and will geet less and less likely as time goes on.


-Hoss
http://www.lucidworks.com/


Re: versions of documentation: suggestion for improvement

2018-04-24 Thread Chris Hostetter

: I also noticed that there's the concept of "latest" (similar to "current"
: in postgres documentation) in solr. This is pretty cool. I am afraid
: though, that this currently is somewhat confusing. E.g., if I search for
: managed schema in google I get this as 1st url:
: 
: 
https://lucene.apache.org/solr/guide/6_6/schema-factory-definition-in-solrconfig.html
: 
: now if I try to replace 6_6 with latest:

i'm not sure where you got the impression that *replacing* a version 
number with "/latest" in the URL was designed to work ... the redirects 
are setup so that if you *remove* the version number, the rest of the path 
will redirect to the current version.

https://lucene.apache.org/solr/guide/schema-factory-definition-in-solrconfig.html


-Hoss
http://www.lucidworks.com/


Re: versions of documentation: suggestion for improvement

2018-04-23 Thread Chris Hostetter


There's been some discussion along the lines of doing some things like 
what you propose which were spun out of discussion in SOLR-10595 into the 
issue LUCENE-7924 ... but so far no one has attempted the 
tooling/scripting work needed to make it happen.

Pathes certainly welcome.



: Date: Mon, 23 Apr 2018 09:55:35 +0200
: From: Arturas Mazeika 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: versions of documentation: suggestion for improvement
: 
: Hi Solr-Team,
: 
: If I google for specific features for solr, I usually get redirected to 6.6
: version of the documentation, like this one:
: 
: 
https://lucene.apache.org/solr/guide/6_6/overview-of-documents-fields-and-schema-design.html
: 
: Since I am playing with 7.2 version of solr, I almost always need to change
: this manually through to:
: 
: 
https://lucene.apache.org/solr/guide/7_2/overview-of-documents-fields-and-schema-design.html
: 
: (by clicking on the url, going to the number, and replacing two
: characters). This is somewhat cumbersome (especially after the first dozen
: of changes in urls. Suggestion:
: 
: (1) Would it make sense to include other versions of the document as urls
: on the page? See, e.g., the following documentation of postgres, where each
: page has a pointer to the same page in different versions:
: 
: https://www.postgresql.org/docs/9.6/static/sql-createtable.html
: 
: (especially "This page in other versions: 9.3
:  / 9.4
:  / 9.5
:  / *9.6* /
: current
:  (10
: )" line on
: the page)
: 
: (2) Would it make sense in addition to include "current", pointing to the
: latest current release?
: 
: This would help to find solr relevant infos from search engines faster.
: 
: Cheers,
: Arturas
: 

-Hoss
http://www.lucidworks.com/


Re: Solr 6.6.2 Master/Slave SSL Replication Error

2018-04-22 Thread Chris Hostetter

You need to configure Solr to use a "truststore" that contains the 
certificate you want it to trust.  With a solr cloud setup, that usually 
involves configuring the "keystore" and the "truststore" to both contain 
the same keys...

https://lucene.apache.org/solr/guide/6_6/enabling-ssl.html


: Date: Sat, 21 Apr 2018 14:40:08 -0700 (MST)
: From: kway 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Solr 6.6.2 Master/Slave SSL Replication Error
: 
: ... looking at this line, I am wondering if this is an issue because I am
: using a Self-Signed Certificate:
: 
: Caused by: javax.net.ssl.SSLHandshakeException:
: sun.security.validator.ValidatorException: PKIX path building failed:
: sun.security.provider.certpath.SunCertPathBuilderException: unable to find
: valid certification path to requested target
: 
: How would I get this to work with a self-signed cert?
: 
: Regards,
: 
: Kelly
: 
: 
: 
: --
: Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
: 

-Hoss
http://www.lucidworks.com/


Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-22 Thread Chris Hostetter

Not that i can think of -- the existing API is really designed with the 
focus that the the ResponseWriter needs to specify the Content-Type prior 
to "writting" any bytes so Solr can do the best job possible "streaming" 
the data over the wire ... if you need to "pre-process" a lot of the 
response data before you know what you want to stream back, you have to do 
that in the getContentType() method .. but you can always store that 
pre-processed result in the SolrQueryRequest.getContext() and then write 
it out later.

(you could in theory do a ton of processing to prepare a big byte[] of all 
the data you want to send out as part of getContentType(), and then your 
write() method could be a single line of code)



: Date: Sun, 22 Apr 2018 15:49:30 +0100
: From: Lee Carroll <lee.a.carr...@googlemail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: custom response writer which extends RawResponseWriter fails when
:  shards > 1
: 
: Hi,
: I've ended up processing the doclist in the response twice. Once in the
: write method and once in getContent. Its a bit inefficient but i'm only
: looking at top doc each time so probably ok.
: Is their a better way to do this ?
: 
: Cheers lee C
: 
: On 22 April 2018 at 13:26, Lee Carroll <lee.a.carr...@googlemail.com> wrote:
: 
: > Hi,
: > That works a treat. The raw response writer has a configurable base class
: > which executes when no content stream is present in the response so I just
: > delegate to that. I do still have an issue with writing content type on the
: > http response from a value in the top document however. Although I do have
: > a getContentType method which returns a value from a field in the top
: > document this is called before the response writer has performed its write
: > method.
: >
: > Whats the best way to set response headers using values from documents in
: > the search result? In particular Content-Type.
: >
: > Cheers Lee C
: >
: > On 20 April 2018 at 20:03, Chris Hostetter <hossman_luc...@fucit.org>
: > wrote:
: >
: >>
: >> Invariant really means "invariant" ... nothing can change them
: >>
: >> In the case of "wt" this may seem weird and unhelpful, but the code that
: >> handles defaults/appends/invariants is ignorant of what the params are.
: >>
: >> Since your writting custom code anyway, my suggestion would be that
: >> perhaps you could make your custom ResponseWriter delegate to the javabin
: >> responsewriter if/when you see that this is an "isShard=true" request?
: >>
: >>
: >>
: >> : Date: Thu, 19 Apr 2018 18:42:58 +0100
: >> : From: Lee Carroll <lee.a.carr...@googlemail.com>
: >> : Reply-To: solr-user@lucene.apache.org
: >> : To: solr-user@lucene.apache.org
: >> : Subject: Re: custom response writer which extends RawResponseWriter
: >> fails when
: >> :  shards > 1
: >> :
: >> : Hi,
: >> :
: >> : I rewrote all of my tests to use SolrCloudTestCase rather than
: >> SolrTestCaseJ4
: >> : and was able to replicate the responsewriter issue and debug with a
: >> sharded
: >> : collection. It turned out the issue was not with my response writer
: >> really
: >> : but rather my config.
: >> :
: >> : 
: >> : 
: >> :
: >> : 
: >> : content
: >> : 
: >> :
: >> : 
: >> :
: >> : In cloud mode having wt as an invariant breaks the collation of results
: >> : from shards. Now I'm sure this is a common mistake which I've repeated
: >> : (blush) but I do sort of want to actually implement my request handler
: >> in
: >> : this way. Is their a way to have a request handler support a single
: >> : response writer but still work in cloud mode ?
: >> :
: >> : Could this be considered a bug ?
: >> :
: >> : Lee C
: >> :
: >> : On 18 April 2018 at 13:13, Mikhail Khludnev <m...@apache.org> wrote:
: >> :
: >> : > Injecting headers might require deeper customisation up to
: >> establishing own
: >> : > filter or so.
: >> : > Speaking regarding your own WT, there might be some issues because
: >> usually
: >> : > it's not a big deal to use one wt for responding user query like
: >> (wt=csv)
: >> : > and wt=javabin in internal communication between aggregator and
: >> slaves like
: >> : > it happens in wt=csv query.
: >> : >
: >> : > On Wed, Apr 18, 2018 at 2:19 PM, Lee Carroll <
: >> lee.a.carr...@googlemail.com
: >> : > >
: >> : > wrote:
: >> : >
: >> : > > Inventive. I need to con

Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-20 Thread Chris Hostetter

Invariant really means "invariant" ... nothing can change them

In the case of "wt" this may seem weird and unhelpful, but the code that 
handles defaults/appends/invariants is ignorant of what the params are.

Since your writting custom code anyway, my suggestion would be that 
perhaps you could make your custom ResponseWriter delegate to the javabin 
responsewriter if/when you see that this is an "isShard=true" request?



: Date: Thu, 19 Apr 2018 18:42:58 +0100
: From: Lee Carroll 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: custom response writer which extends RawResponseWriter fails when
:  shards > 1
: 
: Hi,
: 
: I rewrote all of my tests to use SolrCloudTestCase rather than SolrTestCaseJ4
: and was able to replicate the responsewriter issue and debug with a sharded
: collection. It turned out the issue was not with my response writer really
: but rather my config.
: 
: 
: 
: 
: 
: content
: 
: 
: 
: 
: In cloud mode having wt as an invariant breaks the collation of results
: from shards. Now I'm sure this is a common mistake which I've repeated
: (blush) but I do sort of want to actually implement my request handler in
: this way. Is their a way to have a request handler support a single
: response writer but still work in cloud mode ?
: 
: Could this be considered a bug ?
: 
: Lee C
: 
: On 18 April 2018 at 13:13, Mikhail Khludnev  wrote:
: 
: > Injecting headers might require deeper customisation up to establishing own
: > filter or so.
: > Speaking regarding your own WT, there might be some issues because usually
: > it's not a big deal to use one wt for responding user query like (wt=csv)
: > and wt=javabin in internal communication between aggregator and slaves like
: > it happens in wt=csv query.
: >
: > On Wed, Apr 18, 2018 at 2:19 PM, Lee Carroll  >
: > wrote:
: >
: > > Inventive. I need to control content-type of the response from the
: > document
: > > field value. I have the actual content field and the content-type field
: > to
: > > use configured in the response writer. I've just noticed that the xslt
: > > transformer allows you to do this but not controlled by document values.
: > I
: > > may also need to set some headers based on content-type and perhaps
: > content
: > > size, accept ranges comes to mind. Although I might be getting ahead of
: > > myself.
: > >
: > >
: > >
: > > On 18 April 2018 at 12:05, Mikhail Khludnev  wrote:
: > >
: > > > well ..
: > > > what if
: > > > http://localhost:8983/solr/images/select?fl=content=id:
: > > 1=1=csv&
: > > > csv.separator==null
: > > > ?
: > > >
: > > > On Wed, Apr 18, 2018 at 1:18 PM, Lee Carroll <
: > > lee.a.carr...@googlemail.com
: > > > >
: > > > wrote:
: > > >
: > > > > sorry cut n paste error i'd get
: > > > >
: > > > > {
: > > > >   "responseHeader":{
: > > > > "zkConnected":true,
: > > > > "status":0,
: > > > > "QTime":0,
: > > > > "params":{
: > > > >   "q":"*:*",
: > > > >   "fl":"content",
: > > > >   "rows":"1"}},
: > > > >   "response":{"numFound":1,"start":0,"docs":[
: > > > >   {
: > > > > "content":"my-content-value"}]
: > > > >   }}
: > > > >
: > > > >
: > > > > but you get my point
: > > > >
: > > > >
: > > > >
: > > > > On 18 April 2018 at 11:13, Lee Carroll  >
: > > > > wrote:
: > > > >
: > > > > > for http://localhost:8983/solr/images/select?fl=content=id:
: > > 1=1
: > > > > >
: > > > > > I'd get
: > > > > >
: > > > > > {
: > > > > >   "responseHeader":{
: > > > > > "zkConnected":true,
: > > > > > "status":0,
: > > > > > "QTime":1,
: > > > > > "params":{
: > > > > >   "q":"*:*",
: > > > > >   "_":"1524046333220"}},
: > > > > >   "response":{"numFound":1,"start":0,"docs":[
: > > > > >   {
: > > > > > "id":"1",
: > > > > > "content":"my-content-value",
: > > > > > "*content-type*":"text/plain"}]
: > > > > >   }}
: > > > > >
: > > > > > when i want
: > > > > >
: > > > > > my-content-value
: > > > > >
: > > > > >
: > > > > >
: > > > > > On 18 April 2018 at 10:55, Mikhail Khludnev 
: > wrote:
: > > > > >
: > > > > >> Lee, from this description I don see why it can't be addressed by
: > > > > fl,rows
: > > > > >> params. What makes it different form the typical Solr usage?
: > > > > >>
: > > > > >>
: > > > > >> On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll <
: > > > > >> lee.a.carr...@googlemail.com>
: > > > > >> wrote:
: > > > > >>
: > > > > >> > Sure, we want to return a single field's value for the top
: > > matching
: > > > > >> > document for a given query. Bare content rather than a full
: > search
: > > > > >> result
: > > > > >> > listing.
: > > > > >> >
: > > > > >> > To be concrete:
: > > > > >> >
: > > > > >> > For a schema of fields id [unique key],
: > > > content[stored],content-type[
: > > > > >> > 

Re: Migrating from Solr 6.6 getStatistics() to Solr 7.x

2018-04-06 Thread Chris Hostetter

: In my Solr 6.6 based code, I have the following line that get the total
: number of documents in a collection:
: 
: totalDocs=indexSearcher.getStatistics().get("numDocs"))
... 
: With Solr 7.2.1, 'getStatistics' is no longer available, and it seems that
: it is replaced by 'collectionStatistics' or 'termStatistics':
...
: So my questions is what is the equivalent statement in solr 7.2.1? Is it:
: 
: solrIndexSearcher.collectionStatistics("numDocs").maxDoc();

Uh... no.  that's not quite true.

In the 6.x code line, getStatistics() was part of the SolrInfoMBean API 
that SolrIndexSearcher and many other Solr objects implemented...

http://lucene.apache.org/solr/6_6_0/solr-core/org/apache/solr/search/SolrIndexSearcher.html#getStatistics--

In 7.0, SolrInfoMBean was replaced with SolrInfoBean as part ofthe switch 
over to the new more robust the Metrics API...

https://lucene.apache.org/solr/guide/7_0/major-changes-in-solr-7.html#jmx-support-and-mbeans
https://lucene.apache.org/solr/guide/7_0/metrics-reporting.html
http://lucene.apache.org/solr/7_0_0/solr-core/org/apache/solr/core/SolrInfoBean.html

(The collectionStatistics() and termStatistics() methods are lower level 
Lucene concepts)

IIRC The closest 7.x equivilent to "indexSearcher.getStatistics()" is 
"indexSearcher.getMetricsSnapshot()" ... but the keys in that map will 
have slightly diff/longer names then they did before, you can use 
"indexSearcher.getMetricNames()" so see the full list.

...but frankly that's all a very comlicated way to get "numDocs" 
if you're writting a solr plugin that has direct access to a 
SolrIndexSearcher instance ... you can just call 
"solrIndexSearcher.numDocs()" method and make your life a lot simpler.



-Hoss
http://www.lucidworks.com/


Re: Copy field on dynamic fields?

2018-04-05 Thread Chris Hostetter

: Have you tried reading existing example schemas? They show various
: permutations of copy fields.

Hmm... as the example schema's have been simplified/consolidated/purged it 
seems we have lost the specific examples that are relevant to the users 
question -- the only instance of a glob'ed copyField in any of the 
configsets we ship is with a single destination field.

And the ref guide doesn't mention globs in copyField dest either? 
(created SOLR-12191)

Jatin: what you are asking about is 100% possible -- here's some examples 
from one of our test configs used specifically for testing copyField...

  
  
  

This ensures that any field name starting with "dynamic_" is also copied 
to an "equivilent" field name *ending* with "_dynamic"

so "1234_dynamic" gets copied to "dynamic_1234", "foo_dynamic" gets copied 
to "dynamic_foo" etc...

This "glob" pattern in copyFields also works even if the underlying fields 
are not dynamicField...

  
  
  
  

 so "sku1" and "sku2" will be each copied to "1_s" and "2_s" respectively 
... you could also mix & match that with a  if you wanted sku1 and sku2 to have special types, but some ohther more 
common type for other sku* fields.






: Regards,
: Alex
: 
: On Thu, Apr 5, 2018, 2:54 AM jatin roy,  wrote:
: 
: > Any update?
: > 
: > From: jatin roy
: > Sent: Tuesday, April 3, 2018 12:37 PM
: > To: solr-user@lucene.apache.org
: > Subject: Copy field on dynamic fields?
: >
: > Hi,
: >
: > Can we create copy field on dynamic fields? If yes then how it decide
: > which field should be copied to which one?
: >
: > For example: if I have dynamic field: category_* and while indexing 4
: > fields are formed such as:
: > category_1
: > category_2
: > category_3
: > category_4
: > and now I have to copy the contents of already existing dynamic field
: > "category_*" to "new_category_*".
: >
: > So my question is how the algorithm decides that category_1 data has to be
: > indexed in new_category_1 ?
: >
: > Regards
: > Jatin Roy
: > Software developer
: >
: >
: 

-Hoss
http://www.lucidworks.com/


Re: Error in indexing JSON with space in value

2018-03-22 Thread Chris Hostetter
: 
: Ah, there's the extra bit of context:
: > PS C:\curl> .\curl '
: 
: You're using Windows perhaps?  If so, it's probably a shell issue
: getting all of the data to the "curl" command.

Yep.. and you cna even see in the trace output that curl thinks the entire 
JSON payload you want to send is 24 bytes long, and ends with '"Joe'...

: > 0090: 6a 73 6f 6e 0d 0a 43 6f 6e 74 65 6e 74 2d 4c 65 json..Content-Le
: > 00a0: 6e 67 74 68 3a 20 32 34 0d 0a 0d 0a ngth: 24
: > => Send data, 24 bytes (0x18)
: > : 20 7b 20 20 20 69 64 3a 31 2c 20 20 20 6e 61 6d  {   id:1,   nam
: > 0010: 65 5f 73 3a 20 4a 6f 65 e_s: Joe
: > == Info: upload completely sent off: 24 out of 24 bytes


-Hoss
http://www.lucidworks.com/


Re: Error in indexing JSON with space in value

2018-03-22 Thread Chris Hostetter


I can't reproduce the problem you described -- using 7.2.1 and the 
techproducts example i can index a JSON string w/white space just fine...

$ bin/solr -e techproducts
$ curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 
'Content-type:application/json' -d '
{
  "id":"1",
  "name_s": "Joe Smith",
  "phone_s": 876876687}'
{
  "responseHeader":{
"status":0,
"QTime":5}}

Your specific curl command had some oddities in it (".\" before "curl";
"-H" on a new line w/o "\" escaping the newline) that may have just been
artifacts of copy/paste into email...

But even when trying to run the command as i think you intended it, i did 
not get a JSON parsing error regarding space -- just a nested doc error 
because your "split" param wasn't compatible with the configured default 
"srcField" option in the techproducts configset...

$ curl 
'http://localhost:8983/solr/techproducts/update/json/docs?srcField==/|/orgs'
 -H 'Content-type:application/json' -d '
{
  "id":"1",
  "name_s": "Joe Smith",
  "phone_s": 876876687,
  "orgs": [
{
  "name1_s" : "Microsoft",
  "city_s" : "Seattle",
  "zip_s" : 98052},
{
  "name1_s" : "Apple",
  "city_s" : "Cupertino",
  "zip_s" : 95014}
  ]
}'
{
  "responseHeader":{
"status":400,
"QTime":2},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"Raw data can be stored only if split=/",
"code":400}}




Are you *certain* it was a plain old space character, and that you didn't 
somehow get an EOF character or NUL byte in there some how?

Can you try running your curl command with the '--trace -' option and send 
us the full output?





: Date: Thu, 22 Mar 2018 23:48:21 +0800
: From: Zheng Lin Edwin Yeo 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Error in indexing JSON with space in value
: 
: Hi,
: 
: I am trying to index the following JSON, in which there is a space in the
: name "Joe Smith".
: 
: .\curl 'http://localhost:8983/solr/collection/update/json/docs?split=/|/orgs
: '
: -H 'Content-type:application/json' -d '
: {
:   "id":"1",
:   "name_s": "Joe Smith",
:   "phone_s": 876876687,
:   "orgs": [
: {
:   "name1_s" : "Microsoft",
:   "city_s" : "Seattle",
:   "zip_s" : 98052},
: {
:   "name1_s" : "Apple",
:   "city_s" : "Cupertino",
:   "zip_s" : 95014}
:   ]
: }'
: 
: However, I get the following error during the indexing.
: 
: {
:   "responseHeader":{
: "status":400,
: "QTime":1},
:   "error":{
: "metadata":[
:   "error-class","org.apache.solr.common.SolrException",
:   "root-error-class","org.apache.solr.common.SolrException"],
: "msg":"Cannot parse provided JSON: Expected ',' or '}':
: char=(EOF),position=24 AFTER=''",
: "code":400}}
: curl: (3) [globbing] bad range specification in column 39
: 
: 
: If I remove the space in "Joe Smith" to make it "JoeSmith", then the
: indexing is successful. What can we do if we want to keep the space in the
: name? Do we need to include some escape characters or something?
: 
: I'm using Solr 7.2.1.
: 
: Regards,
: Edwin
: 

-Hoss
http://www.lucidworks.com/


Re: Why are cursor mark queries recommended over regular start, rows combination?

2018-03-13 Thread Chris Hostetter

: > 3) Lastly, it is not clear the role of export handler. It seems that the
: > export handler would also have to do exactly the same kind of thing as
: > start=0 and rows=1000,000. And that again means bad performance.

: <3> First, streaming requests can only return docValues="true"
: fields.Second, most streaming operations require sorting on something
: besides score. Within those constraints, streaming will be _much_
: faster and more efficient than cursorMark. Without tuning I saw 200K
: rows/second returned for streaming, the bottleneck will be the speed
: that the client can read from the network. First of all you only
: execute one query rather than one query per N rows. Second, in the
: cursorMark case, to return a document you and assuming that any field
: you return is docValues=false

Just to clarify, there is big difference between the /export handler 
and "streaming expressions"

Unless something has changed drasticly in the past few releases, the 
/export handler does *NOT* support exporting a full *collection* in solr 
cloud -- it only operates on an individual core (aka: shard/replica).  

Streaming expressions is a feature that does work in Cloud mode, and can 
make calls to the /export handler on a replica of each shard in order to 
process the data of an entire collection -- but when doing so it has to 
aggregate the *ALL* the results from every shard in memory on the 
coordinating node -- meaning that (in addition to the docvalues caveat) 
streaming expressions requires you to "spend" a lot of ram usage on one 
node as a trade off for spending more time & multiple requests to get teh 
same data from cursorMark...

https://lucene.apache.org/solr/guide/exporting-result-sets.html
https://lucene.apache.org/solr/guide/streaming-expressions.html

An additional perk of cursorMakr that may be relevant to the OP is that 
you can "stop" tailing a cursor at anytime (ie: if you're post processing 
the results client side and decide you have "enough" results) but a simila 
feature isn't available (AFAICT) from streaming expressions...

https://lucene.apache.org/solr/guide/pagination-of-results.html#tailing-a-cursor


-Hoss
http://www.lucidworks.com/


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Chris Hostetter

: We are using Solr 7.1.0 to index a database of addresses.  We have found 
: that our index size increases massively when we add one extra field to 
: the index, even though that field is stored and not indexed, and doesn’t 

what about docValues?

: When we run an index load without the problematic field present, the 
: Solr index size is 5.5GB.  When we add the field into the index, the 
: size grows to 13.3GB.  The field itself is a maximum of 46 characters in 
: length and on average is 19 characters. We have ~14,000,000 rows in 
: total to index of which only ~200,000 have this field present at all 
: (i.e. not null in database).  Given that we don’t want to index the 
: field, only store it I would have thought (perhaps naively) that the 
: storage increase would be approximately 200,000 * 19 = 3.8M bytes = 
: 3.6MB rather than the 7.5GB we are seeing.

if the field has docValues enabled, then there will be some overhead for 
every doc in the index -- even the ones that don't have a value in this 
field.  (allthough i'd still be very suprised if it accounted for 7G)

: - The problematic field is created through the API as follows:
: 
:   curl -X POST -H 'Content-type:application/json' --data-binary '{
: "add-field":{
:   "name":"buildingName",
:   "type":"string",
:   "stored":true,
:   "indexed":false
: }
:   }' http://localhost:8983/solr/address/schema

...that's going to cause the field to inherit any (non-overridden) 
settings from the fieldType "string" -- in the 7.1 _default configset, 
"string" is defined with docValues="true"

You can see *all* properties set on a field -- regardless of wether they 
are set on the fieldType, or are implicit hardcoded defaults in the 
implementation of the fieldType via the 'showDefaults=true' Schema API 
option.

Consider these API examples from the techproducts demo...

$ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
{
  "responseHeader":{
"status":0,
"QTime":0},
  "field":{
"name":"cat",
"type":"string",
"multiValued":true,
"indexed":true,
"stored":true}}

$ curl 
'http://localhost:8983/solr/techproducts/schema/fields/cat?showDefaults=true'
{
  "responseHeader":{
"status":0,
"QTime":0},
  "field":{
"name":"cat",
"type":"string",
"indexed":true,
"stored":true,
"docValues":false,
"termVectors":false,
"termPositions":false,
"termOffsets":false,
"termPayloads":false,
"omitNorms":true,
"omitTermFreqAndPositions":true,
"omitPositions":false,
"storeOffsetsWithPositions":false,
"multiValued":true,
"large":false,
"sortMissingLast":true,
"required":false,
"tokenized":false,
"useDocValuesAsStored":true}}







-Hoss
http://www.lucidworks.com/

Re: "editorialMarkerFieldName"

2018-02-12 Thread Chris Hostetter

https://issues.apache.org/jira/browse/SOLR-11977

: Date: Mon, 12 Feb 2018 14:44:34 -0700 (MST)
: From: Chris Hostetter <hossman_luc...@fucit.org>
: To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
: Subject: Re: "editorialMarkerFieldName"
: 
: 
: IIUC the "editorialMarkerFieldName" config option is a bit missleading.
: 
: Configuring that doesn't automatically add a field w/that name to your 
: docs to indicate which of them have been elevated -- all it does is 
: provide an *override* for what name can be used to refer to the 
: "[elevated]" DocTransformer.
: 
: So by default you can do something like this...
: 
:   q=ipod=text=id,[elevated]
: 
: ...but if you have foo in 
: your searchComponent config, then instead of "[elevated]" you would have 
: to say...
: 
: q=ipod=text=id,[foo]
: 
: ...to get the same info.
: 
: It's a very weird and silly feature -- i honestly can't give you 
: any good explaination as towhy it was implemented that way.
: 
: 
: 
: : Date: Mon, 5 Feb 2018 04:12:27 +
: : From: Sadiki Latty <sla...@uottawa.ca>
: : Reply-To: solr-user@lucene.apache.org
: : To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
: : Subject: "editorialMarkerFieldName"
: : 
: : Hello
: : 
: : I have added the "editorialMarkerFieldName" to my search component but 
nothing happens. Am I missing something in my configuration? I have confirmed 
that the elevation aspect is working as it should. The documents in the 
'elevate.xml' are being elevated so the component is being read, but 
specifically that parameter does not seem to change the result. I have a 
configuration similar to the one below (copied from the guide) and I have the 
elevator str in my 'last-components' section of my request handler.
: : 
: : 
: :   
: :   string
: :   elevate.xml
: :  foo
: : 
: : 
: : 
: : Am I misunderstanding the purpose of this parameter? Isnt it supposed to 
distinguish the elevated results from the normal results with the given string?
: : 
: : I am using Solr 7.1.0 btw
: : 
: : Thanks in advance,
: : 
: : Sid
: : 
: 
: -Hoss
: http://www.lucidworks.com/
: 

-Hoss
http://www.lucidworks.com/


Re: "editorialMarkerFieldName"

2018-02-12 Thread Chris Hostetter

IIUC the "editorialMarkerFieldName" config option is a bit missleading.

Configuring that doesn't automatically add a field w/that name to your 
docs to indicate which of them have been elevated -- all it does is 
provide an *override* for what name can be used to refer to the 
"[elevated]" DocTransformer.

So by default you can do something like this...

q=ipod=text=id,[elevated]

...but if you have foo in 
your searchComponent config, then instead of "[elevated]" you would have 
to say...

q=ipod=text=id,[foo]

...to get the same info.

It's a very weird and silly feature -- i honestly can't give you 
any good explaination as towhy it was implemented that way.



: Date: Mon, 5 Feb 2018 04:12:27 +
: From: Sadiki Latty 
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" 
: Subject: "editorialMarkerFieldName"
: 
: Hello
: 
: I have added the "editorialMarkerFieldName" to my search component but 
nothing happens. Am I missing something in my configuration? I have confirmed 
that the elevation aspect is working as it should. The documents in the 
'elevate.xml' are being elevated so the component is being read, but 
specifically that parameter does not seem to change the result. I have a 
configuration similar to the one below (copied from the guide) and I have the 
elevator str in my 'last-components' section of my request handler.
: 
: 
:   
:   string
:   elevate.xml
:  foo
: 
: 
: 
: Am I misunderstanding the purpose of this parameter? Isnt it supposed to 
distinguish the elevated results from the normal results with the given string?
: 
: I am using Solr 7.1.0 btw
: 
: Thanks in advance,
: 
: Sid
: 

-Hoss
http://www.lucidworks.com/


RE: DovValues and in-place udpates

2018-02-12 Thread Chris Hostetter

: True, I could remove the trigger to rebuild the entire document. But 
: what if a different field changes and the whole document is triggered 
: for update for a different field. We have the same problem.

at a high level, your concern is really compleltey orthoginal to the
question of in-place updates, it's a broader question of having 2 diff
systems that might want to modify the same document in solr, but one
system is "slower" then the other (because it has to fetch more external  
data or only operates in batches, etc...)

This is where things like optimistic concurrency are really powerful.

When you trigger your "slow" updates (or any updates for that matter),
keep track of the current (aka "expected") _version_ field of the solr
document when your updater starts processing -- and pass that in along   
with the new update -- solr will reject an update if the specified
_version_ doesn't match what's in the index.

https://lucene.apache.org/solr/guide/updating-parts-of-documents.html#optimistic-concurrency

So imagine the current instock=1 version of your product is 42, and you
start a "slow" update to change the "name" field ... while that's in 
progress a "fast" update sets instock=0 and now you have a new 
_version_=666.  When the "slow" updater is done building up the entire 
document, and sends it to solr along with the _version_=42 assumption, 
solr will reject the update with a "Conflict (409)" HTTP Status, and your 
slow update code can say "ok ... i must have stale data, let's try again"



: 
: -Original Message-
: From: Erick Erickson [mailto:erickerick...@gmail.com] 
: Sent: Monday, February 12, 2018 11:17 AM
: To: solr-user 
: Subject: Re: DovValues and in-place udpates
: 
: "But it also triggers a slow update that will rebuild the entire document..."
: 
: Why do you think this? The whole _point_ of in-place updates is that they 
don't have to re-index the whole document And the only way to do that 
effectively would be if all the fields are stored, which is not a requirement 
for in-place updates.
: 
: Best,
: Erick
: 
: On Mon, Feb 12, 2018 at 8:02 AM, Brian Yee  wrote:
: > I asked a question here about fast inventory updates last week and I was 
recommended to use docValues with partial in-place updates. I think this will 
work well, but there is a problem I can't think of a good solution for.
: >
: > Consider this scenario:
: > InStock = 1 for a product.
: > InStock changes to 0 which triggers a fast in-place update with docValues.
: > But it also triggers a slow update that will rebuild the entire document. 
Let's say that takes 10 minutes because we do updates in batches.
: > During that 5 minutes, InStock changes again to 1 which triggers a fast 
update to solr. So in Solr InStock=1 which is correct.
: > The slow update finishes and overwrites InStock=0 which is incorrect.
: >
: > How can we deal with this situation?
: 

-Hoss
http://www.lucidworks.com/


RE: External file fields

2018-02-02 Thread Chris Hostetter

: Interesting. I will definitely explore this. Just so I'm clear, we can 
: sort on docValues, but not filter? Is there any situation where external 
: file fields would work better than docValues?

For most field types that support docValues, you can still filter on it 
even if it's indexed="false" -- but the filtering may not be as efficient 
as using indexed values.  for numeric fields you certainly can.

One situation where ExternalFileFiled would probably preferable to doing 
inplace updates on docValues is when you know you need to update the value 
for *every* document in your collection in batch -- for large 
collections, looping over every doc and sending an atomic update would 
probably be slower then just replacing the external file.

Another example when i would probably choose external file field over 
docValues is if the "keyField" was not the same as my uniqueKey field ... 
ie: if i have millions of documents each with a category_id that has a 
cardinality of ~100 categories.  I could use 
the category_id field as the keyField to associate every doc w/some 
numeric "category_rank" value (that varies only per category).  If i 
need/want to tweak 1 of those 100 category_rank values updating the 
entire external file just to change that 1 value is still probably much 
easier then redundemntly putting that category_rank field in every 
doc and sending an atomic update to all ~10K docs that have 
same category_id,category_rank i want to change.


: 
: -Original Message-
: From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
: Sent: Friday, February 2, 2018 12:24 PM
: To: solr-user@lucene.apache.org
: Subject: RE: External file fields
: 
: 
: : I did look into updatable docValues, but my understanding is that the
: : field has to be non-indexed (indexed="false"). I need to be able to sort
: : on these values. External field fields are sortable.
: 
: YOu can absolutely sort on a field that is docValues="true" 
: indexed="false" ... that is much more efficient then sorting on a field that 
is docValues="false" index="true" -- in the later case solr has to build a 
fieldcache (aka: run-time-mock-docvalues) from the indexed values the first 
time you try to sort on the field after a searcher is opened
: 
: 
: 
: -Hoss
: http://www.lucidworks.com/
: 

-Hoss
http://www.lucidworks.com/


Re: Query fields with data of certain length

2018-02-02 Thread Chris Hostetter

: Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ?
...
: > An example of the string in Chinese is 预支款管理及账务处理办法
: >
: > The number of characters is 12, but the expected length should be 36.
...
: >> > So this would likely be different from what the operating system
: >> counts, as
: >> > the operating system may consider each Chinese characters as 3 to 4
: >> bytes.
: >> > Which is probably why I could not find any record with
: >> subject:/.{255,}.*/

Java regexes operate on unicode strings, so ".' matches any *character*
There is no regex syntax to match an any "byte" so a regex based approach 
is never going to be viable.

You're best bet is to check the byte count when indexing -- but even then 
you'd need some custom code since things like 
FieldLengthUpdateProcessorFactory are well behaved and count the 
*characters* of the unicode strings.

If you absolutely can't reindex, then you'd need a custom QParser that 
produced a custom Query object that iterated over the TermEnum looking at 
the buffers and counting the bytes in each term -- matching each doc 
assocaited with those terms.



-Hoss
http://www.lucidworks.com/

RE: External file fields

2018-02-02 Thread Chris Hostetter

: I did look into updatable docValues, but my understanding is that the 
: field has to be non-indexed (indexed="false"). I need to be able to sort 
: on these values. External field fields are sortable.

YOu can absolutely sort on a field that is docValues="true" 
indexed="false" ... that is much more efficient then sorting on a field 
that is docValues="false" index="true" -- in the later case solr has to 
build a fieldcache (aka: run-time-mock-docvalues) from the indexed values 
the first time you try to sort on the field after a searcher is opened



-Hoss
http://www.lucidworks.com/


Re: SOLR 7.1 queries not including empty fields in results

2018-01-24 Thread Chris Hostetter

: I am converting a SOLR 4.10 db to SOLR 7.1
: 
: It is NOT schemaless - so it uses a ClassicIndexSchemaFactory.
: 
: In 4.10, I have a field that is a phone number (here's the schema information 
for the field):
: 
: 
: 
: When inserting documents into SOLR, there are some documents where the 
: value of Phone is an empty string or a single blank space.
... 
: But when these same rows are inserted into SOLR 7.1, the documents 
: returned for those rows have no Phone field

Are you still using the same solrconfig.xml you had in 4.10, or did you 
switch to using a newer sample/default set (or in some other way 
modified) solrconfig.xml?

I ask because even if you are using the ClassicIndexSchemaFactory, your 
update processor chain might be using TrimFieldUpdateProcessorFactory 
and/or RemoveBlankFieldUpdateProcessorFactory ?

When i use the sample techproducts configs in 7.1, I have no problem 
adding either an empty string or a bank space to a string field...



$ bin/solr -e techproducts
...
$ curl -H 'Content-Type: application/json' 
'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary 
'[{"id":"white","foo_s":" "},{"id":"blank","foo_s":""}]'
{
  "responseHeader":{
"status":0,
"QTime":40}}
$ curl 'http://localhost:8983/solr/techproducts/query?q=foo_s:*'
{
  "responseHeader":{
"status":0,
"QTime":12,
"params":{
  "q":"foo_s:*"}},
  "response":{"numFound":2,"start":0,"docs":[
  {
"id":"white",
"foo_s":" ",
"_version_":1590517543569719296},
  {
"id":"blank",
"foo_s":"",
"_version_":1590517543570767872}]
  }}




-Hoss
http://www.lucidworks.com/


Re: Issues with refine parameter when subfaceting in a range facet

2018-01-24 Thread Chris Hostetter

: We encountered an issue when using the refine parameter when subfaceting in
: a range facet.
: When enabling the refine option, the counts of the response are the double
: of the counts of the response without refine option.
: We are running Solr 6.6.1 in a cloud setup.
...
: If I execute the same query WITHOUT refine: true in the subfacet, I get the
: following response:
...
: There is a factor 2 difference for each count in each bucket.
: 
: If I perform the same queries with a larger range gap, e.g.
:   \"start\":0.0,
:   \"end\":55000.0,
:   \"gap\":5000.0,
: there is no difference between the response with and without refine: true.

FWIW: Based on the info you provided it's impossible for us to tell which 
of those 2 responses is "correct" ... ie: you know the source data, and 
have access to the index, and can do a test query where you filter (fq) on 
those ranges + field values to confirm what the "correct" numFound is -- 
we can't.  So we can't be certain if you're getting the 
"correct" counts with or with out the refine param -- which is kind of 
important to help troubleshoot.

That said...

: Is this a known issue, or is there something we are overlooking?
: And is there information on whether or not this behavior will be the same
: in Solr 7?

...off the top of my head, I don't know of any specific bugs in 6.x with 
type:range facet counts being doubled/halfed when a refined/non-refined 
subfacet is used -- but it's certainly possible that one existed and was 
explicitly/invidently fixed.  

I can tell you that very recently I worked on 2 jiras involving type:range 
faceting -- one of which was a bug that caused incorrect results when 
"mincount" was used in cloud mode -- and as a result added some more 
robust testing (to master and branch_7x) of type:range facets as both 
parent facets and sub-facets of type:term facets -- and i did not 
encounter any sort of missmatches like the one you are describing...

https://issues.apache.org/jira/browse/SOLR-11824
https://issues.apache.org/jira/browse/SOLR-3218

...if you can reproduce this with branch_7x (7.3) then please file a jira 
with some sample data/config/queries that can be used to reproduce.


-Hoss
http://www.lucidworks.com/


Re: BinaryResponseWriter fetches unnecessary fields?

2018-01-24 Thread Chris Hostetter

: Thanks Chris! Is RetrieveFieldsOptimizer a new functionality introduced in
: 7.x?  Our observation is with botht 5.4 & 6.4.  I have created a jira for
: the issue:

The same basic code path (related to stored fields) probably existed 
largely as is in 5.x and 6.x and was then later refactored into  
RetrieveFieldsOptimizer where it knows about things like the 
useDocValuesAsStored option/optimization.

-Hoss
http://www.lucidworks.com/


Re: BinaryResponseWriter fetches unnecessary fields?

2018-01-22 Thread Chris Hostetter

: Inside convertLuceneDocToSolrDoc():
: 
: 
: https://github.com/apache/lucene-solr/blob/df874432b9a17b547acb24a01d3491
: 839e6a6b69/solr/core/src/java/org/apache/solr/response/
: DocsStreamer.java#L182
: 
: 
:for (IndexableField f : doc.getFields())
: 
: 
: I am a bit puzzled why we need to iterate through all the fields in the
: document. Why can’t we just iterate through the requested fields in fl?
: Specifically:

I have a hunch here -- but i haven't verified it.

First of all: the specific code in question that you mention assumes it 
doesn't *need* to filter out the result of "doc.getFields()" basd on the 
'fl' because at the point in the processing where the DocsStreamer is 
looping over the result of "doc.getFields()" the "Document" object it's 
dealing with *should* only contain the specific (subset of stored) fields 
requested by the fl param -- this is handled by RetrieveFieldsOptimizer & 
SolrDocumentFetcher that the DocsStreamer builds up acording to the 
results of ResultContext.getReturnFields() when asking the 
SolrIndexSearcher to fetch the doc()

But i think what's happening here is that because of the documentCache, 
there are cases where the SolrIndexSearcher is not actaully using 
a SolrDocumentStoredFieldVisitor to limit what's requested from the 
IndexReader, and the resulting Document contains all fields -- which is 
then compounded by code that loops over every field.  

At a quick glance, I'm a little fuzzy on how exactly 
enableLazyFieldLoading may/may-not be affecting things here, but either 
way I think you are correct -- we can/should make this overall stack of 
code smarter about looping over fields we know we want, vs looping over 
all fields in the doc.

Can you please file a jira for this?


-Hoss
http://www.lucidworks.com/

Re: PayloadScoreQuery always returns score of zero

2018-01-15 Thread Chris Hostetter

what does your full request, including the results block look like when 
you search on one of these queries with "fl=*,score" ?

I'm suspicios that perhaps the problem isn't the payload encoding, or the 
PayloadScoreQuery -- but perhaps it's simply a bug in the Explanation 
produced by those queries?

: Date: Wed, 13 Dec 2017 14:15:48 -0600
: From: John Anonymous 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: PayloadScoreQuery always returns score of zero
: 
: The PayloadScoreQuery always returns a score of zero, regardless of
: payloads.  The PayloadCheckQParser works fine, so I know that I am
: successfully indexing the payloads.   Details below
: 
: *payload field that I am searching on:*
: 
: 
: *definition of payload field type:*
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: 
: *Adding some documents with payloads in my test:*assertU(adoc(
: "key", "1",
: "report", "apple¯0 apple¯0 apple¯0"
: ));
: assertU(adoc(
: "key", "2",
: "report", "apple¯1 apple¯1 text¯1"
: ));
: 
: 
: *query:*{!payload_score f=report v=apple func=sum}
: 
: *score (both documents have a score of zero):*
: 
: 
: 0.0 = SumPayloadFunction.docScore()
: 
: 
: 0.0 = SumPayloadFunction.docScore()
: 
:   
: 
: I have tried using func=max as well, but it makes no difference.  Can
: anyone help me with what I am missing here?
: Thanks!
: Johnathan
: 

-Hoss
http://www.lucidworks.com/

Re: Heavy operations in PostFilter are heavy

2018-01-12 Thread Chris Hostetter

: Yes I do so. The Problem ist that the collect-Method is called for EVERY 
: document the query matches. Even if the User only wants to see like 10 
: documents. The Operation I have to perform takes maybe 50ms/per document 

You running into a classic chicken/egg problem with document collection 
& filtering -- you don't want your expensive filte to be run against every 
doc that matches the query (and lower cost filters) just the "top 10" the 
user is going to see -- but solr doesn't know what those top 10 are yet, 
not untill it's collected & sorted all of them ... nad your PostFilter 
can change what gets collected ... it's a filter!

Also: Things like Faceting (and even just returning an accurate numFound!) 
require that all matches be "collected" ... unless you are useing sorted 
segments and early termintation, your PostFilter has to be consulted about 
every (potential) match in order for the results to be accurate.

: if have to process them singel. And maybe 30ms if I could get a 
: Document-List. But if the user e.g. uses an Wildcard query that matches 

If processing in batch is a viable option then, one approach you may want 
to consider is to take the approach used by the CollapseQParser and the 
PostFilter it generates -- it doesn't pass on any collected documents to 
it's delegate as it collects them -- it essentially just batches them all 
up, and then in the "finish" method it processes them and calls 
delegate.collect() on the ones it decies are important.

-Hoss
http://www.lucidworks.com/


Re: trivia question: why q=*:* doesn't return same result as q.alt=*:*

2018-01-12 Thread Chris Hostetter

: defType=dismax does NOT do anything special with *:* other than treat it 
...
: > As Chris explained, this is special:
...

I'm interpreting your followup question differently then Erick & Erik 
did.  I'm going to assume both E & E missunderstood your question, and i'm 
going to assume you completley understood my response to your original 
question.

I'm going to assume that a way to rewrod/expand your followup question is 
something like this...

"I understand now that defType=dismax doesn't support special syntax like 
'*:*' and treats that 3 input as just another 3 character string to search 
against the qf & pf fields -- but now what i don't understand is why are 
list of fields in the debug query output is different for 'q=*:*' compared 
to something like 'q=hello'"

(If i have not understood your followup question correctly, please 
clarify)

Let's look at those outputs you mentioned...

: >> http://localhost:8983/solr/filesearch/select?fq=id:1193;
: >> q=*:*=true
: >> 
: >> 
: >>   - parsedquery: "+DisjunctionMaxQuery((user_email:*:* | user_name:*:* |
: >>   tags:*:* | (name_shingle_zh-cn:, , name_shingle_zh-cn:, ,) |
: >> id:*:*)~0.01)
: >>   DisjunctionMaxQuery(((name_shingle_zh-cn:", , , ,"~100)^100.0 |
: >>   tags:*:*)~0.01)",
...
: >> e.g. following query uses the my expected set of pf and qf.
...
: >> http://localhost:8983/solr/filesearch/select?fq=id:1193;
: >> q=hello=true
: >> 
: >> 
: >> 
: >>   - parsedquery: "+DisjunctionMaxQuery(((name_token:hello)^60.0 |
: >>   user_email:hello | (name_combined:hello)^10.0 | (name_zh-cn:hello)^10.0
: >> |
: >>   name_shingle:hello | comments:hello | user_name:hello |
: >> description:hello |
: >>   file_content_zh-cn:hello | file_content_de:hello | tags:hello |
: >>   file_content_it:hell | file_content_fr:hello | file_content_es:hell |
: >>   file_content_en:hello | id:hello)~0.01)
: >> DisjunctionMaxQuery((description:hello
: >>   | (name_shingle:hello)^100.0 | comments:hello | tags:hello)~0.01)",


The answer has to do with the list of qf & pf fields you have confiugred 
-- you didn't provide us with concrete specifics of what qf/pf you 
have configured in your requestHandler -- but you did mention in your 
second example that "following query uses the my expected set of pf and 
qf"

By comparing the 2 examples at a glance, It appears that the fields in the 
first example (q=*:* ... again, searching for the literal 3 character 
string '*:*') are (mostly) a subset of the fields you "expected" (from the 
2nd example)

I'm fairly certain that what's happening here is that in both examples the 
literal string input is being given to the analyzer for all of your fields 
-- but in the case of the (literal) string '*:*' many of the analyzers are 
producing no terms at all -- ie: they are completley striping out 
punctuation -- so they don't appear in the final query.

IIUC it looks like one other oddity here is that the reverse also 
seems to be true in some cases -- i suspect that 
although "name_shingle_zh-cn" doesn't appera in your 2nd example, it 
probably *is* in your pf param but whatever analyzer you have confiured 
for it produces no tokens for the latin characters "hello" but does 
produces tokens for the pure-punctuation characters "*:*"


(If i'm correct about your question, but wrong about your qf/pf then 
please provide us with a lot more details -- notably your full 
schema/solrconfig used when executing those queries.


-Hoss
http://www.lucidworks.com/


Re: SolrException undefined field *

2018-01-09 Thread Chris Hostetter

: Might be the case as you mentioned Shawn. But there are no search requests
: triggered and might be that somehow a search query is getting fired to Solr
: end while indexing. Given the complete log information(solr.log) while the
: indexing is triggered.

the search request is triggered by a "newSearcher" cache warming query -- 
solr explicitly adds an "event=newSearcher" param so it shows up right 
there in the log for the query that's failing...

: 
(searcherExecutor-46-thread-1-processing-x:master_backoffice_backoffice_product_default)
: [   x:master_backoffice_backoffice_product_default] o.a.s.c.S.Request
: [master_backoffice_backoffice_product_default]  webapp=null path=null
: 
params={q=*:*%26facet%3Dtrue%26facet.field%3DcatalogVersion%26facet.field%3DcatalogId%26facet.field%3DapprovalStatus_string%26facet.field%3Dcategory_string_mv=false=newSearcher}
: status=400 QTime=0

You probably have something like this in your solrconfig.xml...


  
   
 *:*facet=true...
  ...

When what you should have is sometihng like...


  
   
 *:*
 true
 ...

https://lucene.apache.org/solr/guide/7_2/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-Query-RelatedListeners



-Hoss
http://www.lucidworks.com/


Re: trivia question: why q=*:* doesn't return same result as q.alt=*:*

2018-01-06 Thread Chris Hostetter

: Yes, i am using dismax. But dismax allows *:* for q.alt ,which also seems
: like inconsistency.

dismax is a *parser* that affects how a single query string is parsed.

when you use defType=dismax, that only changes how the "q" param is 
parsed -- not any other query string params, like "fq" or "facet.query" 
(or "q.alt")

when you have a request like "defType=dismax==*:*" what you are 
saying, and what solr is doing, is...

* YOU: hey solr, use dismax as the default parser for the q param
* SEARCHHANDLER: ok, if the "q" param does not use local params to 
override the parser, i will use dismax
* SEARCHHANDLER: hey dismax qparser, go parse the string ""
* DISMAXQP: that string is empty, so instead we should use q.alt
* SEARCHHANDLER: ok, i will parse the q.alt param and use that query in 
place of the empty q param
* SEARCHHANDLER: hey lucene qparser, the string "*:*" does not use local 
params to override the parser, please parse it
* LUCENEQP: the string "*:*" is a MatchAllDocsQuery
* SEARCHHANDLER: cool, i'll use that as my main query



-Hoss
http://www.lucidworks.com/


RE: Pass field value through function for filtering

2018-01-05 Thread Chris Hostetter

https://lucene.apache.org/solr/guide/7_2/other-parsers.html

fq={!frange l=0}your(complex(func(fieldA,fieldB),fieldC))

As of 7.2, frange filters will default to being PostFilters as long as 
you use cache=false ...

https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
https://issues.apache.org/jira/browse/SOLR-11641


: Date: Tue, 12 Dec 2017 12:15:55 +
: From: Markus Jelsma 
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" 
: Subject: RE: Pass field value through function for filtering
: 
: Forget about it, i just remember PostFilters!
: 
: Thanks!
: Markus
:  
: -Original message-
: > From:Markus Jelsma 
: > Sent: Tuesday 12th December 2017 12:54
: > To: Solr-user 
: > Subject: Pass field value through function for filtering
: > 
: > Hello,
: > 
: > I have a function and a lot of documents, i want to select all documents 
that give a certain value when i pass a document's field through the function, 
i just want to filter by function, how?
: > 
: > I am thinking of implementing Collector. Get the docId, make a field 
look-up and discard if it doesn't pass my function. But i remember from quite 
some time ago, that doing field look-ups there would be expensive.
: > 
: > Although i don't mind it taking some time (it is for batched work behind 
our scenes), do you have suggestions?
: > 
: > Also, the last time i had to use a custom collector, i had to hack it into 
SolrIndexSearcher to use it. Is there any proper way to use a custom collector 
these days?
: > 
: > Many thanks,
: > Markus
: > 
: 

-Hoss
http://www.lucidworks.com/


Re: solr.TrieDoubleField deprecated with 7.1.0 but wildcard "*" search behaviour is different with solr.DoublePointField

2017-12-11 Thread Chris Hostetter

AFAICT The behavior you're describing with Trie fields was never 
intentionally supported/documented? 

It appears that it only worked as a fluke side effect of how the default 
implementation of FieldType.getprefixQuery() was inherited by Trie fields 
*and* because "indexed=true" TrieFields use Terms (just like StrField) ... 
so prefix of "" (the empty string) matched all of the Trie terms in a 
field.

(note that the syntax you're describing does *not* work for Trie fields 
that are "indexed=false docValues=true")

In general, there seems to be a bit of a mess in terms of trying to 
specify "prefix queries" (which is what "foo_d:*" really is under the 
covers) or "wild card" queries against numeric fields. I created a jira to 
try and come to a concensus about how this should behave moving forward...

https://issues.apache.org/jira/browse/SOLR-11746

...but i would suggest you move away from depending on that syntax and use 
the officially supported/documented range query syntax (foo_d[* TO *]) 
instead.




: some question about the new DoublePointField which should be used
: instead of the TrieDoubleField in 7.1.
...
: If i am using the deprecated one its possible to get a match for a
: double field like this:
: 
: test_d:*
: 
: even in 7.1.0.
: 
: But with the new DoublePointField, which you should use instead, you
: won't get that match - you have to use e.g. [* TO *].

: Is this an intended change in runtime / query behaviour or some bug or
: is it possible to restore that behaviour with the new field too?




-Hoss
http://www.lucidworks.com/


Re: JSON-B deserialization of Solr-response with highlightning

2017-12-08 Thread Chris Hostetter

: We're started to migrate our integration-framework to move over to 
: JavaEE JSON-B as default json-serialization /deserialization framework 
: and now the highlighning component is giving us some troubles. Here's a 
: constructed example of the JSON response from Solr.

Wait .. what?  that makes no sense to me...

IIUC JSON-B is is a *binding* frameworkd -- it is intended as a 
replacement for things like java's built in (binary) serialization, or 
JXB's XML serialization -- where you have java class definitions that you 
annotate with information on how to serialize them, and how to repopulate 
them when deserializting a byte stream.

Attempting to use JSON-B as an end-all-be-all general purpose "json 
parser" to read arbitrary JSON from systems that do *NOT* use JSON-B 
seems absurd.  

*Nothing* about the JSON-B documentation (that i've seen) suggests that 
anyone involved in JSON-B would even remotely suggest/recommend you 
attempt to use it to parse JSON that was not *generated* by JSON -- like 
all binding APIs i've ever seen, it assumes/expects that the person 
desgning the (Java) Object API will dictate (via annotations or default 
bindings) the serialization details and the framework will take care of 
the validity of the byte stream representation and the deserialization.

i would *strongly* urge you to rethink attempting to use JSON-B to parse 
Solr's response documents  the "key" in highlighting response will be 
the least of your problems once you decide you want ot use "facets" -- or 
hell, once someone adds a new query param to your request that gets echoed 
back.


-Hoss
http://www.lucidworks.com/


Re: External file field

2017-11-17 Thread Chris Hostetter

: Do I need to define a field with  when I use an external file 
: field? I see the  to define it, but the docs don’t say how 
: to define the field.

you define the field (or dynamicField) just like any other field -- the 
fieldType is where you specify things like the 'keyField' & the 'defVal', 
but then the field/dynamicField definition dictate the underlying 
filename that will be used

So if you want 5 diff ExternalFileFields that all use keyField="id" but you 
only need one  and five  -- but if you need them all 
to have 5 diff 'defVal' then you need five  and five 
 



-Hoss
http://www.lucidworks.com/


Re: TimeZone issue

2017-11-17 Thread Chris Hostetter
: 
: As I said before, I do not think that Solr will use timezones for date display
: -- ever.  Solr does support timezones in certain circumstances, but I'm pretty

One possibility that has been discussed in the pst is the idea of a "Date 
Formatting DocTransformer" that would always return a String in the 
specified format (regardless of the ResponseWriter and any native 
support for Dates it has, ala javabin or xml).

That would be a fairly straight forward plugin to write if someone was so 
inclined -- the hardest part would be deciding on the syntax, so that you 
could specify the clients prefered format, timezone, locale, etc  But 
then folks who really want to pass off Solr's csv/json/whatever 
responsewriter format directly to an end consumer could control that.

(likewise we could imagine a "Number Formatting DocTransformer" that would 
do the same thing for people that really want their Integer's to come back 
as "1,234,567" or their floats to be formated to exactly 4 decimal places, 
etc...)


-Hoss
http://www.lucidworks.com/

  1   2   3   4   5   6   7   8   9   10   >