g line.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On May 25, 2020, at 3:53 AM, Krönert Florian
> wrote:
>
> Hi everyone,
>
> For our Solr instance I have the requirement that all queries should be
> logged, so that we
I’m a little amused that this thread has become active after almost two months
of silence.
I think we just used the old highlighter. I don’t even remember now.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On May 23, 2020, at 9:14 AM, Anthony Gro
they
have no searchable text.
After adding all those, run optimize. This should rewrite all the segments in
the new format.
Finally, delete all the extra documents. Might want to do another optimize
after that.
No guarantee that this desperate hack will work.
wunder
Walter Underwood
wun
Right. I might use NLP to pull out noun phrases and entities. Entities are
essential noun phrases with proper nouns.
Put those in a separate field and build the word cloud from that.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On May 15, 2
up entirely of
stop words. Remove them and it is impossible to search for that phrase.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On May 14, 2020, at 10:47 PM, A Adel wrote:
>
> Hi - Is there a way to configure stop words to be dynamic
Anybody building sharded clusters with Terraform on EC2? I’d love some hints.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
every shard.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On May 9, 2020, at 11:55 PM, morph3u...@web.de wrote:
>
> Hello,
>
> I want to use solr suggest
> (https://lucene.apache.org/solr/guide/8_2/suggester.html) in a solr cloud
The Porter/Snowball stemmer is an evolved version of a forty year old hack.
It is neat that it works at all, but don’t expect too much. I think it is too
aggressive
for search use.
What does KStem do with this? That is based on better linguistic models.
wunder
Walter Underwood
wun
IO, etc.
CloudWatch for load balancer traffic, errors, and healthy host count.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Apr 28, 2020, at 8:00 AM, matthew sporleder wrote:
>
> I think clusterstatus is how you find some of that stuf
a list of words that are assumed to be common and less
useful, let the engine actually measure how common the words are and factor
that into the relevance.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Apr 24, 2020, at 5:39 PM, Steven White wr
I’m astonished that the default still has that. It was a bad idea in Solr 1.3,
when
it bit my ass.
We help people with this about once a month and the advice is always the same.
Imagine all the poor people who never ask about it and run with that default!
wunder
Walter Underwood
wun
stopwords in the index. Removing stop words is a desperate
speed/hack hack from the days of 16-bit machines.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Apr 24, 2020, at 5:44 AM, David Hastings
> wrote:
>
> you should never use
solution.
I’d use Apache Hive, or whatever has replaced it. That is what Facebook wrote
to do searches on their multi-petabyte logs.
https://hive.apache.org
More options.
https://jethro.io/hadoop-hive
https://mapr.com/why-hadoop/sql-hadoop/sql-hadoop-details/
wunder
Walter Underwood
wun
You need to add three letters to “prob” to get “problem”, so it is edit
distance 3.
Fuzzy only works to distance 2.
If you want to match prefixes, edge n-grams are a better approach.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Apr 13, 2
Agreed, leave the stopwords alone. I ran into this same problem
thirteen years ago at Netflix. Even before that, I wasn’t removing
stopwords, but I accidentally left them in the Solr 1.3 config.
https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
wunder
Walter Underwood
using a Java program to load those, but I just wrote a
multi-threaded Python thingy that uses the JSON update handlers.
That is pretty simple code.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 31, 2020, at 11:19 PM, S G wrote:
>
^4 name_ngram^2 infotext
name^8 name_ngram^4 infotext^2
Get rid of:
* StopFilterFactory
* SynonymFilterFactory
* WordDelimiterFilterFactory
With the remaining filters, you’ll never have duplicates, so you can also get
rid of RemoveDupliicatsTokenFilterFactory if you want.
wunder
Walter Underwood
is a proportional weighting of common words based on the statistics of
your documents.
Do not remove stopwords.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 20, 2020, at 7:52 AM, Vikas Kumar wrote:
>
> I have a field title in my so
. This Gist shows how.
https://gist.github.com/nz/673027/313f70681daa985ea13ba33a385753aef951a0f3
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 16, 2020, at 8:20 AM, David Hastings
> wrote:
>
> master slave is the idea that you have
What access do you want to prevent? How do you prefer to authenticate?
How do you manage users or roles? Master/slave or Solr Cloud?
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 16, 2020, at 7:44 AM, Ryan W wrote:
>
> How do you, p
> On Mar 5, 2020, at 4:29 AM, Bunde Torsten wrote:
>
> -Xms512m -Xmx512m
Your heap is too small. Set this to -Xms8g -Xmx8g
In solr.in.sh, that looks like this:
SOLR_HEAP=8g
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
This really, really looks like something that should be done with a
database, not with Solr. This assumes a transactional model, which
Solr doesn’t have.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 3, 2020, at 7:56 PM, Sachin Divekar wr
docid is the natural order of the posting lists, so there is no sorting effort.
I expect that means “don’t sort”.
Also, cross-posting is probably not good. I’m replying only to solr-user.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 28, 2
Many years ago, I accidentally ran Solr with the data dir on an NFS volume.
It was 100X slower.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 26, 2020, at 2:42 PM, Vincenzo D'Amore wrote:
>
> Hi Massimiliano,
>
> it’s not
There is a “docsPending” value in Solr metrics. It is probably available
through JMX. You can get to it through the admin UI, too. Choose a replica,
then look at Plugins/Stats, then Update, then updateHandler.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my
based on lexicon of book
titles is highly effective for us.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 24, 2020, at 9:52 PM, Paras Lehana wrote:
>
> Hey Audrey,
>
> I assume MRR is about the ranking of the inte
that category and run a second
query using
the category scores.
4. Pre-calculate the top 50 results for each category with the slow algorithm
and use the
elevate component to force that ranking for that term.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog
Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 17, 2020, at 10:53 AM, David Hastings
> wrote:
>
> interesting, i cant seem to find anything on Phrase IDF, dont suppose you
> have a link or two i could look at by chance?
>
> On Mon, Feb 17, 2
At Infoseek, we used “glue words” to build phrase tokens. It was really
effective.
Phrase IDF is powerful stuff.
Luckily for you, the patent on that has expired. :-)
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 17, 2020, at 10:46 AM, Da
Why are you using stopwords? I would need a really, really good reason to use
those.
Stopwords are an obsolete technique from 16-bit processors. I’ve never used
them and
I’ve been a search engineer since 1997.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my
garbage collection. That is the only way to have no pauses with
automatic memory management.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 14, 2020, at 11:35 AM, Tom Burton-West wrote:
>
> Hello,
>
> In the section on JVM tuning i
into RAM. This should make a huge
speed difference. You’ll also see GC pauses of 200 ms or less.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 13, 2020, at 9:40 PM, vishal patel
> wrote:
>
> Total memory of server is 256 GB and in
\
-XX:+UseLargePages \
-XX:+AggressiveOpts \
“
If you don’t have a very, very good reason for your GC settings, use these
instead.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 12, 2020, at 10:47 PM, vishal patel
> wrote:
>
>
are a slow
and imprecise way to search. There is almost always a better way.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 13, 2020, at 1:03 AM, Sotiris Fragkiskos wrote:
>
> Hi Erick,
> thanks very much for this information, it w
be something else.
What GC are you using?
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 12, 2020, at 8:16 PM, vishal patel
> wrote:
>
> Is there anyone looking at this?
>
> Sent from Outlook<htt
“kinase*” does match “kinase”. On the page you linked to, it defines “*” as
matching "Multiple characters (matches zero or more sequential characters)”.
If it is not matching, you may be using a stemmer on that field or doing some
other processing that changes the tokens.
wunder
W
QTime=379
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 11, 2020, at 6:28 AM, Erick Erickson wrote:
>
> Wow, that’s pretty horrible performance.
>
> Yeah, I was conflating a couple of things here. Now it’s clear.
>
>
sort=“id asc”
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 10, 2020, at 9:50 PM, Tim Casey wrote:
>
> Walter,
>
> When you do the query, what is the sort of the results?
>
> tim
>
> On Mon, Feb 10, 2020
searching id:0* through id:f*, fetching 1000 rows each time, using
cursorMark and distributed search. Median response time is 10 s. CPU usage is
about 1%.
It is all pretty grubby and it seems like there could be a better way.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org
> On Feb 10, 2020, at 2:24 PM, Walter Underwood wrote:
>
> Not sure if range queries work on a UUID field, ...
A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try
separate queries for 0-f.
wunder
Walter Underwood
wun...@wunderwood
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 10, 2020, at 2:19 PM, Erick Erickson wrote:
>
> Not sure whether cursormark respects distrib=false, although I can easily see
> there being “complications” here.
>
> Hmmm, whenever I
with a single thread and distributed search. Should have
followed the old Kernighan and Plauger rule, “Make it right before youmake it
faster."
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
I wrote some Python that checks CLUSTERSTATUS and reports replica status to
Telegraf. Great for charts and alerts, but it only shows status, not progress.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 7, 2020, at 7:58 AM, Erick Erickson wr
working group. That is still a solid spec.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 6, 2020, at 8:56 AM, Doug Turnbull
> wrote:
>
> Well that is interesting, I did not know that! Thanks Walter...
>
> https://stackover
Repeated keys are quite legal in JSON, but many libraries don’t support that.
It does look like that data layout could be redesigned to be more portable.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 6, 2020, at 8:38 AM, Doug Turnb
by with the smallest possible RAM or disk.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 3, 2020, at 5:28 AM, Erick Erickson wrote:
>
> I’ve always had trouble with that advice, that RAM size should be JVM + index
> size. I’ve seen
be part of a faceted search system.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 2, 2020, at 12:36 PM, Erick Erickson wrote:
>
> Mostly I was reacting to the statement that the number
> of docs increased by over 4x and the
updates also don’t need extra RAM.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 2, 2020, at 7:52 AM, Rajdeep Sahoo wrote:
>
> We have allocated 16 gb of heap space out of 24 g.
> There are 3 solr cores here, for o
.
Does your system have 70+ GB of RAM? If not, a smaller heap means you can keep
more of the index in file buffers. That will make things faster.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 2, 2020, at 1:01 AM, Karl Stoney
> wrote:
&
What message do you get about the heap space.
It is completely normal for Java to use all of heap before running a major GC.
That
is how the JVM works.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Feb 1, 2020, at 6:35 AM, Rajdeep Sahoo wr
Maybe you can give them an estimate of how much work it will be. See if legal
will put it on their budget. Free software isn’t free, especially the “free
kittens” kind.
This guy offers consulting for custom Docker images.
https://pythonspeed.com/about/
wunder
Walter Underwood
wun
You might use Luke to get that info from the index files without loading them
into Solr.
https://code.google.com/archive/p/luke/
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 29, 2020, at 2:01 PM, Rahul Goswami wrote:
>
> Hello,
>
Looking at the log, that takes one or two seconds after a complete batch reload
(master/slave). So that is loading a cold index, all new files. This is not a
big index, about a half million book titles.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog
criminology
developmental
engineering
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 29, 2020, at 1:01 PM, Shawn Heisey wrote:
>
> On 1/29/2020 12:44 PM, Karl Stoney wrote:
>> Looking for a
title, so “Managerial Accounting:
Student Value Edition”
becomes just “Managerial Accounting”. Showing all the variations is the job of
the
real results page.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 24, 2020, at 7:07 AM, Lucky Sha
What message do you get that means the heap space is full?
Java will always use all of the heap, either as live data or not-yet-collected
garbage.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 19, 2020, at 5:47 PM, Rajdeep Sahoo wr
question, how frequently is the index updated?
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 19, 2020, at 4:49 PM, Rajdeep Sahoo wrote:
>
> Hi,
> Still facing the same issue...
> Anything else that we need to check.
>
>
> On
abled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
“
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 19, 2020, at 9:25 AM, Rajdeep Sahoo wrote:
>
> We are using solr 7.7 . Ram size is 24 gb
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 18, 2020, at 9:29 AM, Rajdeep Sahoo wrote:
>
> Hi shawn,
> Thanks for this info,
> Could you Please address my below query,
>
>
> We are having 2.3 million documents and size is 2.5 gb
How big? We index 35 million documents in about 6 hours.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 18, 2020, at 12:05 AM, Rajdeep Sahoo
> wrote:
>
> Our Index size is huge and in master slave the full indexing time is alm
Why do you want to change to Solr Cloud? Master/slave is a great, stable
cluster architecture.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 17, 2020, at 6:19 PM, Rajdeep Sahoo wrote:
>
> Please reply anyone
>
> On Sat, 18 J
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 15, 2020, at 3:42 AM, Dc Tech wrote:
>
> Thank you Jan and Charlie.
>
> I should say that in terms of posting to the community regarding Elastic vs
> Solr - this is probably the most
Remove ALL the stopwords. Remove the stopword filter.
This will happen again and again with different words until you do that.
Stopwords were necessary with 16-bit CPUs. I stopped using them in 1996.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog
If you only have one server, that isn’t production or search isn’t important.
So it doesn’t really matter how you update it.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 9, 2020, at 7:48 AM, Paras Lehana wrote:
>
> Hey Erick,
&g
For master/slave clusters, we have a deploy step that copies the config files
to each server. Then we restart the Solr process. We do that one at a time for
minimal service interruption.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jan 9, 2
I hope you do not plan to use Solr as a primary repository. Solr is NOT a
database. If you use Solr as a database, you will lose data at some point.
The Solr feature set is very different from MySQL. There is no guarantee that a
SQL query can be translated into a Solr query.
wunder
Walter
You could use two queries. Do the first with rows=5, then for the second use
an fq that filters out the IDs of those five. You’ll need to do the first query
again
to do the second and further page of results statelessly, but that should still
be pretty fast.
wunder
Walter Underwood
wun
.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
in the late 1980s.
https://www.researchgate.net/publication/224734039_On_Packet_Switches_with_Infinite_Storage
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Dec 9, 2019, at 11:14 PM, Mikhail Khludnev wrote:
>
> My experience with "Ou
The best approach is to not use stop words at all. That gives better relevance
with less configuration, so it is a total win.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Dec 2, 2019, at 12:24 PM, Jörn Franke wrote:
>
> You can have
all the Lucene
syntax in queries?
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 27, 2019, at 8:37 AM, Mark Miller wrote:
>
> If SolrCloud worked well I’d still agree both options are very valid
> depending on your use case. A
That would be “do-not-overwrite”.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 27, 2019, at 4:38 PM, Walter Underwood wrote:
>
> Even if that works, it is evil as something to leave in a client codebase.
> Maybe a do-no-overwrit
Even if that works, it is evil as something to leave in a client codebase.
Maybe a do-no-overwrite flag would be useful.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 27, 2019, at 3:24 PM, Alexandre Rafalovitch wrote:
>
>
I found the zk uploading stuff to be under-documented. Plus, it requires
installing Solr on the deployment machine.
So I used the Python kazoo package and wrote my own uploader.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 19, 2019, at 5
I explain it this way:
* fq: filtering
* q: filtering and scoring
* bq: scoring
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 12, 2019, at 9:08 AM, Erik Hatcher wrote:
>
>
>
>> On Nov 12, 2019, at 12:01 PM, rhys J wr
If we had IDF for phrases, they would be super effective. The 2X weight is a
hack that mostly works.
Infoseek had phrase IDF and it was a killer algorithm for relevance.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 8, 2019, at 11:08
A”, and that shows up in a query, that term can be queried
against the field matching that vocabulary.
This is how LinkedIn separates people, companies, and places, for example.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 8, 2019, at 10:48
But when you change it to AND, a single misspelling means zero results. That is
usually not helpful.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 8, 2019, at 10:43 AM, David Hastings
> wrote:
>
> is your default operator
herited that implementation and I am really keen to adequate it, what
> would you recommend ?
>
> Cheers
> Guilherme
>
>> On 7 Nov 2019, at 14:43, Walter Underwood wrote:
>>
>> Thanks for posting the files. Looking at schema.xml, I see that you
handlers, weights of 20, 50, and 100 are extremely high. I
don’t think I’ve ever used a weight higher than 16 in a dozen years of
configuring Solr.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri wrote:
>
. Reindex all of the documents.
When indexed with the new analysis chain, the stopwords will not be removed and
they will be searchable.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri wrote:
>
>
If it is the same document, why are you changing the ID? Use the same ID and
you are done. You won’t need to delete previous versions.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Nov 4, 2019, at 8:37 AM, Khare, Kushal (MIND)
>
the collection gets frequent updates and
is getting limited public traffic. That will change on Monday.
Make sure that you have more free RAM than the size of the index. Allow
for the size of the JVM, OS, etc.
Make sure you have plenty of CPU. After you have the RAM, CPU is the
bottleneck.
wunder
Walter
of as a relevance term.
This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek always
beat Google in relevance tests, probably because of phrase IDF.
More Like This could do the same thing, but it seems to be really slow and
not especially useful as a search component.
wunder
Walter
years ago, I hit several movie or TV
titles which were all stopwords. I wrote about them in this blog post.
https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Oct 9, 2019, at 6
Just set Xms and Xmx the same. The server will be running for weeks,
so allocate the memory and get on with it.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Oct 3, 2019, at 11:38 AM, ndra wrote:
>
>> I don’t think having the initial
to the long-lived space.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Oct 3, 2019, at 10:11 AM, ndra wrote:
>
>> When the heap is out of free space that
>> can be recovered with minor GC, the JVM will increase the size if possible.
&g
GC -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"
wunder
Walter Underwood
wun...@wunderwood.org
http://o
Always make Xmx and Xms the same. The heap will increase to the max before a
major GC, so avoid the pauses to grow it.
Use the G1 collector. CMS is really obsolete. We’ve had G1 in prod for at least
three years.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my
while I wasn’t looking) in one request.
I also have a hairy shell script to do /export on each leader after parsing
cluster status. That might be a little large to post to this list, but I can do
it if there is general interest.
wunder
Walter Underwood
wun...@wunderwood.org
http
31G is still a very large heap. We use 8G for all of our different clusters.
Do you have JVM monitoring? Look at the heap used after a major GC. Use that
number, plus some extra, for the heap size.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog
hat behavior.
doc1: glück
doc1 terms: glück, gluck, glueck
doc2: glueck
doc2 terms: glueck
df for glück: 1
df for gluck: 1
df for glueck: 2
The df for the term “glück” is the same whether you expand or not.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
am needs the width
of every character in the current font.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
but at least there
is a match.
coöperation
cooperation
cooepoeration (typewriter umlaut version)
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
ulture/culture-desk/the-curse-of-the-diaeresis
In German, there are corner cases where just stripping the umlaut changes one
word into another, like schön/schon.
Isn’t language fun?
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Aug 30, 2019, at
First, time fetching one million records with all the fields you need, both for
display and for re-ranking. If that is slow, then no amount of cosine code
tweaking will make it fast.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Aug 16, 2
.
As Kernighan and Paugher said in 1978, "Don’t diddle code to make it
faster—find a better algorithm.”
https://en.wikipedia.org/wiki/The_Elements_of_Programming_Style
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Aug 11, 2019, at 10:40
?action=CLUSTERSTATUS=json; |
jq -r ''.cluster.collections[].shards[].replicas[].node_name'' | sort -u`
do
echo $host
done
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jul 15, 2019, at 10:55 PM, Fatima Khan wrote:
>
> Hi All,
>I am working
One of our clusters got as large as 40 c4.8xlarge, another is happy with 4
m4.xlarge and could probably handle the load with one of them. It depends on
the number of documents, query load, types of queries, frequency of updates,
all sorts of things.
wunder
Walter Underwood
wun
I haven’t removed stopwords since 1996, when I joined Infoseek. What is your
special case where you must remove them?
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Jun 22, 2019, at 9:51 PM, akash jayaweera
> wrote:
>
> Hello Walter
101 - 200 of 1642 matches
Mail list logo