Re: Log slow queries to SQL Database using Log4j2 (JDBC)

2020-05-25 Thread Walter Underwood
g line. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 25, 2020, at 3:53 AM, Krönert Florian > wrote: > > Hi everyone, > > For our Solr instance I have the requirement that all queries should be > logged, so that we

Re: hl.preserveMulti in Unified highlighter?

2020-05-23 Thread Walter Underwood
I’m a little amused that this thread has become active after almost two months of silence. I think we just used the old highlighter. I don’t even remember now. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 23, 2020, at 9:14 AM, Anthony Gro

Re: Upgrade 5.5.5 to 8.5.1 / Segment stucked in lucene v6

2020-05-19 Thread Walter Underwood
they have no searchable text. After adding all those, run optimize. This should rewrite all the segments in the new format. Finally, delete all the extra documents. Might want to do another optimize after that. No guarantee that this desperate hack will work. wunder Walter Underwood wun

Re: Dynamic Stopwords

2020-05-15 Thread Walter Underwood
Right. I might use NLP to pull out noun phrases and entities. Entities are essential noun phrases with proper nouns. Put those in a separate field and build the word cloud from that. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 15, 2

Re: Dynamic Stopwords

2020-05-15 Thread Walter Underwood
up entirely of stop words. Remove them and it is impossible to search for that phrase. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 14, 2020, at 10:47 PM, A Adel wrote: > > Hi - Is there a way to configure stop words to be dynamic

Terraform and EC2

2020-05-14 Thread Walter Underwood
Anybody building sharded clusters with Terraform on EC2? I’d love some hints. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: solr suggest is not replicated

2020-05-10 Thread Walter Underwood
every shard. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 9, 2020, at 11:55 PM, morph3u...@web.de wrote: > > Hello, > > I want to use solr suggest > (https://lucene.apache.org/solr/guide/8_2/suggester.html) in a solr cloud

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Walter Underwood
The Porter/Snowball stemmer is an evolved version of a forty year old hack. It is neat that it works at all, but don’t expect too much. I think it is too aggressive for search use. What does KStem do with this? That is based on better linguistic models. wunder Walter Underwood wun

Re: Which Solr metrics do you find important?

2020-04-28 Thread Walter Underwood
IO, etc. CloudWatch for load balancer traffic, errors, and healthy host count. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2020, at 8:00 AM, matthew sporleder wrote: > > I think clusterstatus is how you find some of that stuf

Re: Stopwords impact on search

2020-04-24 Thread Walter Underwood
a list of words that are assumed to be common and less useful, let the engine actually measure how common the words are and factor that into the relevance. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 24, 2020, at 5:39 PM, Steven White wr

Re: Stopwords impact on search

2020-04-24 Thread Walter Underwood
I’m astonished that the default still has that. It was a bad idea in Solr 1.3, when it bit my ass. We help people with this about once a month and the advice is always the same. Imagine all the poor people who never ask about it and run with that default! wunder Walter Underwood wun

Re: Stopwords impact on search

2020-04-24 Thread Walter Underwood
stopwords in the index. Removing stop words is a desperate speed/hack hack from the days of 16-bit machines. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 24, 2020, at 5:44 AM, David Hastings > wrote: > > you should never use

Re: using S3 as the Directory for Solr

2020-04-23 Thread Walter Underwood
solution. I’d use Apache Hive, or whatever has replaced it. That is what Facebook wrote to do searches on their multi-petabyte logs. https://hive.apache.org More options. https://jethro.io/hadoop-hive https://mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details/ wunder Walter Underwood wun

Re: Fuzzy search not working

2020-04-13 Thread Walter Underwood
You need to add three letters to “prob” to get “problem”, so it is edit distance 3. Fuzzy only works to distance 2. If you want to match prefixes, edge n-grams are a better approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 13, 2

Re: handling stopwords for special scenarios

2020-04-09 Thread Walter Underwood
Agreed, leave the stopwords alone. I ran into this same problem thirteen years ago at Netflix. Even before that, I wasn’t removing stopwords, but I accidentally left them in the Solr 1.3 config. https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ wunder Walter Underwood

Re: Solrcloud 7.6 OOM due to unable to create native threads

2020-04-01 Thread Walter Underwood
using a Java program to load those, but I just wrote a multi-threaded Python thingy that uses the JSON update handlers. That is pretty simple code. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 31, 2020, at 11:19 PM, S G wrote: >

Re: Scoring partial match in title field higher than exact match in description field

2020-03-31 Thread Walter Underwood
^4 name_ngram^2 infotext name^8 name_ngram^4 infotext^2 Get rid of: * StopFilterFactory * SynonymFilterFactory * WordDelimiterFilterFactory With the remaining filters, you’ll never have duplicates, so you can also get rid of RemoveDupliicatsTokenFilterFactory if you want. wunder Walter Underwood

Re: Weird issues when using synonyms and stopwords together

2020-03-20 Thread Walter Underwood
is a proportional weighting of common words based on the statistics of your documents. Do not remove stopwords. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 20, 2020, at 7:52 AM, Vikas Kumar wrote: > > I have a field title in my so

Re: How do *you* restrict access to Solr?

2020-03-16 Thread Walter Underwood
. This Gist shows how. https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3 wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 16, 2020, at 8:20 AM, David Hastings > wrote: > > master slave is the idea that you have

Re: How do *you* restrict access to Solr?

2020-03-16 Thread Walter Underwood
What access do you want to prevent? How do you prefer to authenticate? How do you manage users or roles? Master/slave or Solr Cloud? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 16, 2020, at 7:44 AM, Ryan W wrote: > > How do you, p

Re: Problem with Solr 7.7.2 after OOM

2020-03-05 Thread Walter Underwood
> On Mar 5, 2020, at 4:29 AM, Bunde Torsten wrote: > > -Xms512m -Xmx512m Your heap is too small. Set this to -Xms8g -Xmx8g In solr.in.sh, that looks like this: SOLR_HEAP=8g wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Custom update processor and race condition with concurrent requests

2020-03-04 Thread Walter Underwood
This really, really looks like something that should be done with a database, not with Solr. This assumes a transactional model, which Solr doesn’t have. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 3, 2020, at 7:56 PM, Sachin Divekar wr

Re: Why does Solr sort on _docid_ with rows=0 ?

2020-02-28 Thread Walter Underwood
docid is the natural order of the posting lists, so there is no sorting effort. I expect that means “don’t sort”. Also, cross-posting is probably not good. I’m replying only to solr-user. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 28, 2

Re: Time out problems with the Solr server 8.4.1

2020-02-26 Thread Walter Underwood
Many years ago, I accidentally ran Solr with the data dir on an NFS volume. It was 100X slower. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 26, 2020, at 2:42 PM, Vincenzo D'Amore wrote: > > Hi Massimiliano, > > it’s not

Re: How to check for uncommitted changes

2020-02-26 Thread Walter Underwood
There is a “docsPending” value in Solr metrics. It is probably available through JMX. You can get to it through the admin UI, too. Choose a replica, then look at Plugins/Stats, then Update, then updateHandler. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my

Re: Query Autocomplete Evaluation

2020-02-24 Thread Walter Underwood
based on lexicon of book titles is highly effective for us. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 24, 2020, at 9:52 PM, Paras Lehana wrote: > > Hey Audrey, > > I assume MRR is about the ranking of the inte

Re: Best Practises around relevance tuning per query

2020-02-18 Thread Walter Underwood
that category and run a second query using the category scores. 4. Pre-calculate the top 50 results for each category with the slow algorithm and use the elevate component to force that ranking for that term. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 17, 2020, at 10:53 AM, David Hastings > wrote: > > interesting, i cant seem to find anything on Phrase IDF, dont suppose you > have a link or two i could look at by chance? > > On Mon, Feb 17, 2

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
At Infoseek, we used “glue words” to build phrase tokens. It was really effective. Phrase IDF is powerful stuff. Luckily for you, the patent on that has expired. :-) wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 17, 2020, at 10:46 AM, Da

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
Why are you using stopwords? I would need a really, really good reason to use those. Stopwords are an obsolete technique from 16-bit processors. I’ve never used them and I’ve been a search engineer since 1997. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my

Re: Outdated information on JVM heap sizes in Solr 8.3 documentation?

2020-02-14 Thread Walter Underwood
garbage collection. That is the only way to have no pauses with automatic memory management. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 14, 2020, at 11:35 AM, Tom Burton-West wrote: > > Hello, > > In the section on JVM tuning i

Re: Replica is going into recovery in Solr 6.1.0

2020-02-14 Thread Walter Underwood
into RAM. This should make a huge speed difference. You’ll also see GC pauses of 200 ms or less. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 13, 2020, at 9:40 PM, vishal patel > wrote: > > Total memory of server is 256 GB and in

Re: Replica is going into recovery in Solr 6.1.0

2020-02-13 Thread Walter Underwood
\ -XX:+UseLargePages \ -XX:+AggressiveOpts \ “ If you don’t have a very, very good reason for your GC settings, use these instead. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 12, 2020, at 10:47 PM, vishal patel > wrote: > >

Re: wildcards match end-of-word?

2020-02-13 Thread Walter Underwood
are a slow and imprecise way to search. There is almost always a better way. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 13, 2020, at 1:03 AM, Sotiris Fragkiskos wrote: > > Hi Erick, > thanks very much for this information, it w

Re: Replica is going into recovery in Solr 6.1.0

2020-02-12 Thread Walter Underwood
be something else. What GC are you using? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 12, 2020, at 8:16 PM, vishal patel > wrote: > > Is there anyone looking at this? > > Sent from Outlook<htt

Re: wildcards match end-of-word?

2020-02-11 Thread Walter Underwood
“kinase*” does match “kinase”. On the page you linked to, it defines “*” as matching "Multiple characters (matches zero or more sequential characters)”. If it is not matching, you may be using a stemmer on that field or doing some other processing that changes the tokens. wunder W

Re: cursorMark and shards? (6.6.2)

2020-02-11 Thread Walter Underwood
QTime=379 wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 11, 2020, at 6:28 AM, Erick Erickson wrote: > > Wow, that’s pretty horrible performance. > > Yeah, I was conflating a couple of things here. Now it’s clear. > >

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Walter Underwood
sort=“id asc” wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 10, 2020, at 9:50 PM, Tim Casey wrote: > > Walter, > > When you do the query, what is the sort of the results? > > tim > > On Mon, Feb 10, 2020

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Walter Underwood
searching id:0* through id:f*, fetching 1000 rows each time, using cursorMark and distributed search. Median response time is 10 s. CPU usage is about 1%. It is all pretty grubby and it seems like there could be a better way. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Walter Underwood
> On Feb 10, 2020, at 2:24 PM, Walter Underwood wrote: > > Not sure if range queries work on a UUID field, ... A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f. wunder Walter Underwood wun...@wunderwood

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Walter Underwood
Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 10, 2020, at 2:19 PM, Erick Erickson wrote: > > Not sure whether cursormark respects distrib=false, although I can easily see > there being “complications” here. > > Hmmm, whenever I

cursorMark and shards? (6.6.2)

2020-02-10 Thread Walter Underwood
with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster." wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Checking in on Solr Progress

2020-02-07 Thread Walter Underwood
I wrote some Python that checks CLUSTERSTATUS and reports replica status to Telegraf. Great for charts and alerts, but it only shows status, not progress. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 7, 2020, at 7:58 AM, Erick Erickson wr

Re: JSON from Term Vectors Component

2020-02-06 Thread Walter Underwood
working group. That is still a solid spec. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 6, 2020, at 8:56 AM, Doug Turnbull > wrote: > > Well that is interesting, I did not know that! Thanks Walter... > > https://stackover

Re: JSON from Term Vectors Component

2020-02-06 Thread Walter Underwood
Repeated keys are quite legal in JSON, but many libraries don’t support that. It does look like that data layout could be redesigned to be more portable. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 6, 2020, at 8:38 AM, Doug Turnb

Re: How to compute index size

2020-02-03 Thread Walter Underwood
by with the smallest possible RAM or disk. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 3, 2020, at 5:28 AM, Erick Erickson wrote: > > I’ve always had trouble with that advice, that RAM size should be JVM + index > size. I’ve seen

Re: Solr 7.7 heap space is getting full

2020-02-02 Thread Walter Underwood
be part of a faceted search system. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 2, 2020, at 12:36 PM, Erick Erickson wrote: > > Mostly I was reacting to the statement that the number > of docs increased by over 4x and the

Re: Solr 7.7 heap space is getting full

2020-02-02 Thread Walter Underwood
updates also don’t need extra RAM. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 2, 2020, at 7:52 AM, Rajdeep Sahoo wrote: > > We have allocated 16 gb of heap space out of 24 g. > There are 3 solr cores here, for o

Re: G1GC Pauses (Young Gen)

2020-02-02 Thread Walter Underwood
. Does your system have 70+ GB of RAM? If not, a smaller heap means you can keep more of the index in file buffers. That will make things faster. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 2, 2020, at 1:01 AM, Karl Stoney > wrote: &

Re: Solr 7.7 heap space is getting full

2020-02-01 Thread Walter Underwood
What message do you get about the heap space. It is completely normal for Java to use all of heap before running a major GC. That is how the JVM works. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 1, 2020, at 6:35 AM, Rajdeep Sahoo wr

Re: Oracle OpenJDK to Amazon Corretto OpenJDK

2020-01-31 Thread Walter Underwood
Maybe you can give them an estimate of how much work it will be. See if legal will put it on their budget. Free software isn’t free, especially the “free kittens” kind. This guy offers consulting for custom Docker images. https://pythonspeed.com/about/ wunder Walter Underwood wun

Re: How expensive is core loading?

2020-01-29 Thread Walter Underwood
You might use Luke to get that info from the index files without loading them into Solr. https://code.google.com/archive/p/luke/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 29, 2020, at 2:01 PM, Rahul Goswami wrote: > > Hello, >

Re: Solr Searcher 100% Latency Spike

2020-01-29 Thread Walter Underwood
Looking at the log, that takes one or two seconds after a complete batch reload (master/slave). So that is loading a cold index, all new files. This is not a big index, about a half million book titles. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog

Re: Solr Searcher 100% Latency Spike

2020-01-29 Thread Walter Underwood
criminology developmental engineering wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 29, 2020, at 1:01 PM, Shawn Heisey wrote: > > On 1/29/2020 12:44 PM, Karl Stoney wrote: >> Looking for a

Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Walter Underwood
title, so “Managerial Accounting: Student Value Edition” becomes just “Managerial Accounting”. Showing all the variations is the job of the real results page. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 24, 2020, at 7:07 AM, Lucky Sha

Re: Solr 7.7 heap space is getting full

2020-01-19 Thread Walter Underwood
What message do you get that means the heap space is full? Java will always use all of the heap, either as live data or not-yet-collected garbage. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 19, 2020, at 5:47 PM, Rajdeep Sahoo wr

Re: Solr 7.7 heap space is getting full

2020-01-19 Thread Walter Underwood
question, how frequently is the index updated? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 19, 2020, at 4:49 PM, Rajdeep Sahoo wrote: > > Hi, > Still facing the same issue... > Anything else that we need to check. > > > On

Re: Solr 7.7 heap space is getting full

2020-01-19 Thread Walter Underwood
abled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=200 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ “ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 19, 2020, at 9:25 AM, Rajdeep Sahoo wrote: > > We are using solr 7.7 . Ram size is 24 gb

Re: Solr cloud production set up

2020-01-18 Thread Walter Underwood
Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 18, 2020, at 9:29 AM, Rajdeep Sahoo wrote: > > Hi shawn, > Thanks for this info, > Could you Please address my below query, > > > We are having 2.3 million documents and size is 2.5 gb

Re: Solr cloud production set up

2020-01-18 Thread Walter Underwood
How big? We index 35 million documents in about 6 hours. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 18, 2020, at 12:05 AM, Rajdeep Sahoo > wrote: > > Our Index size is huge and in master slave the full indexing time is alm

Re: Solr cloud production set up

2020-01-17 Thread Walter Underwood
Why do you want to change to Solr Cloud? Master/slave is a great, stable cluster architecture. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 17, 2020, at 6:19 PM, Rajdeep Sahoo wrote: > > Please reply anyone > > On Sat, 18 J

Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-15 Thread Walter Underwood
Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 15, 2020, at 3:42 AM, Dc Tech wrote: > > Thank you Jan and Charlie. > > I should say that in terms of posting to the community regarding Elastic vs > Solr - this is probably the most

Re: Search phrase not parsed properly

2020-01-10 Thread Walter Underwood
Remove ALL the stopwords. Remove the stopword filter. This will happen again and again with different words until you do that. Stopwords were necessary with 16-bit CPUs. I stopped using them in 1996. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog

Re: Add Solr files to VCS (GIT)

2020-01-09 Thread Walter Underwood
If you only have one server, that isn’t production or search isn’t important. So it doesn’t really matter how you update it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 9, 2020, at 7:48 AM, Paras Lehana wrote: > > Hey Erick, &g

Re: Add Solr files to VCS (GIT)

2020-01-09 Thread Walter Underwood
For master/slave clusters, we have a deploy step that copies the config files to each server. Then we restart the Solr process. We do that one at a time for minimal service interruption. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 9, 2

Re: support need in solr for min and max

2020-01-08 Thread Walter Underwood
I hope you do not plan to use Solr as a primary repository. Solr is NOT a database. If you use Solr as a database, you will lose data at some point. The Solr feature set is very different from MySQL. There is no guarantee that a SQL query can be translated into a Solr query. wunder Walter

Re: Boosting only top n results that match a criteria

2019-12-27 Thread Walter Underwood
You could use two queries. Do the first with rows=5, then for the second use an fq that filters out the IDs of those five. You’ll need to do the first query again to do the second and further page of results statelessly, but that should still be pretty fast. wunder Walter Underwood wun

Solr Cloud on Docker?

2019-12-13 Thread Walter Underwood
. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: native Thread - solr 8.2.0

2019-12-09 Thread Walter Underwood
in the late 1980s. https://www.researchgate.net/publication/224734039_On_Packet_Switches_with_Infinite_Storage wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 9, 2019, at 11:14 PM, Mikhail Khludnev wrote: > > My experience with "Ou

Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Walter Underwood
The best approach is to not use stop words at all. That gives better relevance with less configuration, so it is a total win. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 2, 2019, at 12:24 PM, Jörn Franke wrote: > > You can have

Re: A Last Message to the Solr Users

2019-11-27 Thread Walter Underwood
all the Lucene syntax in queries? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 27, 2019, at 8:37 AM, Mark Miller wrote: > > If SolrCloud worked well I’d still agree both options are very valid > depending on your use case. A

Re: Prevent Solr overwriting documents

2019-11-27 Thread Walter Underwood
That would be “do-not-overwrite”. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 27, 2019, at 4:38 PM, Walter Underwood wrote: > > Even if that works, it is evil as something to leave in a client codebase. > Maybe a do-no-overwrit

Re: Prevent Solr overwriting documents

2019-11-27 Thread Walter Underwood
Even if that works, it is evil as something to leave in a client codebase. Maybe a do-no-overwrite flag would be useful. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 27, 2019, at 3:24 PM, Alexandre Rafalovitch wrote: > >

Re: Zk upconfig command is appending local directory to default confdir

2019-11-19 Thread Walter Underwood
I found the zk uploading stuff to be under-documented. Plus, it requires installing Solr on the deployment machine. So I used the Python kazoo package and wrote my own uploader. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 19, 2019, at 5

Re: using fq means no results

2019-11-12 Thread Walter Underwood
I explain it this way: * fq: filtering * q: filtering and scoring * bq: scoring wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 12, 2019, at 9:08 AM, Erik Hatcher wrote: > > > >> On Nov 12, 2019, at 12:01 PM, rhys J wr

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Walter Underwood
If we had IDF for phrases, they would be super effective. The 2X weight is a hack that mostly works. Infoseek had phrase IDF and it was a killer algorithm for relevance. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 8, 2019, at 11:08

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Walter Underwood
A”, and that shows up in a query, that term can be queried against the field matching that vocabulary. This is how LinkedIn separates people, companies, and places, for example. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 8, 2019, at 10:48

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Walter Underwood
But when you change it to AND, a single misspelling means zero results. That is usually not helpful. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 8, 2019, at 10:43 AM, David Hastings > wrote: > > is your default operator

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Walter Underwood
herited that implementation and I am really keen to adequate it, what > would you recommend ? > > Cheers > Guilherme > >> On 7 Nov 2019, at 14:43, Walter Underwood wrote: >> >> Thanks for posting the files. Looking at schema.xml, I see that you

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Walter Underwood
handlers, weights of 20, 50, and 100 are extremely high. I don’t think I’ve ever used a weight higher than 16 in a dozen years of configuring Solr. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 7, 2019, at 6:56 AM, Guilherme Viteri wrote: >

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread Walter Underwood
. Reindex all of the documents. When indexed with the new analysis chain, the stopwords will not be removed and they will be searchable. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 5, 2019, at 8:56 AM, Guilherme Viteri wrote: > >

Re: Delete documents from the Solr index using SolrJ

2019-11-04 Thread Walter Underwood
If it is the same document, why are you changing the ID? Use the same ID and you are done. You won’t need to delete previous versions. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 4, 2019, at 8:37 AM, Khare, Kushal (MIND) >

Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
the collection gets frequent updates and is getting limited public traffic. That will change on Monday. Make sure that you have more free RAM than the size of the index. Allow for the size of the JVM, OS, etc. Make sure you have plenty of CPU. After you have the RAM, CPU is the bottleneck. wunder Walter

Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
of as a relevance term. This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek always beat Google in relevance tests, probably because of phrase IDF. More Like This could do the same thing, but it seems to be really slow and not especially useful as a search component. wunder Walter

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
years ago, I hit several movie or TV titles which were all stopwords. I wrote about them in this blog post. https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 9, 2019, at 6

Re: Solr standalone timeouts after upgrading to SOLR 7

2019-10-03 Thread Walter Underwood
Just set Xms and Xmx the same. The server will be running for weeks, so allocate the memory and get on with it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 3, 2019, at 11:38 AM, ndra wrote: > >> I don’t think having the initial

Re: Solr standalone timeouts after upgrading to SOLR 7

2019-10-03 Thread Walter Underwood
to the long-lived space. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 3, 2019, at 10:11 AM, ndra wrote: > >> When the heap is out of free space that >> can be recovered with minor GC, the JVM will increase the size if possible. &g

Re: Solr standalone timeouts after upgrading to SOLR 7

2019-10-03 Thread Walter Underwood
GC -- wunder 2017-01-23 # Settings from https://wiki.apache.org/solr/ShawnHeisey GC_TUNE=" \ -XX:+UseG1GC \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=200 \ -XX:+UseLargePages \ -XX:+AggressiveOpts \ " wunder Walter Underwood wun...@wunderwood.org http://o

Re: Solr standalone timeouts after upgrading to SOLR 7

2019-10-03 Thread Walter Underwood
Always make Xmx and Xms the same. The heap will increase to the max before a major GC, so avoid the pauses to grow it. Use the G1 collector. CMS is really obsolete. We’ve had G1 in prod for at least three years. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my

Re: Solrcloud export all results sorted by score

2019-10-01 Thread Walter Underwood
while I wasn’t looking) in one request. I also have a hairy shell script to do /export on each leader after parsing cluster status. That might be a little large to post to this list, but I can do it if there is general interest. wunder Walter Underwood wun...@wunderwood.org http

Re: Throughput does not increase in spite of low CPU usage

2019-09-30 Thread Walter Underwood
31G is still a very large heap. We use 8G for all of our different clusters. Do you have JVM monitoring? Look at the heap used after a major GC. Use that number, plus some extra, for the heap size. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Walter Underwood
hat behavior. doc1: glück doc1 terms: glück, gluck, glueck doc2: glueck doc2 terms: glueck df for glück: 1 df for gluck: 1 df for glueck: 2 The df for the term “glück” is the same whether you expand or not. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-04 Thread Walter Underwood
am needs the width of every character in the current font. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Walter Underwood
but at least there is a match. coöperation cooperation cooepoeration (typewriter umlaut version) wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Walter Underwood
ulture/culture-desk/the-curse-of-the-diaeresis In German, there are corner cases where just stripping the umlaut changes one word into another, like schön/schon. Isn’t language fun? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 30, 2019, at

Re: Solr is very slow with term vectors

2019-08-16 Thread Walter Underwood
First, time fetching one million records with all the fields you need, both for display and for re-ranking. If that is slow, then no amount of cosine code tweaking will make it fast. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 16, 2

Re: Solr is very slow with term vectors

2019-08-11 Thread Walter Underwood
. As Kernighan and Paugher said in 1978, "Don’t diddle code to make it faster—find a better algorithm.” https://en.wikipedia.org/wiki/The_Elements_of_Programming_Style wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 11, 2019, at 10:40

Re: API to get all the solr nodes (active/down)

2019-07-16 Thread Walter Underwood
?action=CLUSTERSTATUS=json; | jq -r ''.cluster.collections[].shards[].replicas[].node_name'' | sort -u` do echo $host done wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 15, 2019, at 10:55 PM, Fatima Khan wrote: > > Hi All, >I am working

Re: Hardware requirements to host Apache Solr application

2019-07-15 Thread Walter Underwood
One of our clusters got as large as 40 c4.8xlarge, another is happy with 4 m4.xlarge and could probably handle the load with one of them. It depends on the number of documents, query load, types of queries, frequency of updates, all sorts of things. wunder Walter Underwood wun

Re: Identify stopwords using TF-IDF

2019-06-22 Thread Walter Underwood
I haven’t removed stopwords since 1996, when I joined Infoseek. What is your special case where you must remove them? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 22, 2019, at 9:51 PM, akash jayaweera > wrote: > > Hello Walter

<    1   2   3   4   5   6   7   8   9   10   >