Re: Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

2020-10-02 Thread Michael McCandless
+1 to make this less trappy.

It looks like KoreanPartOfSpeechStopFilterFactory will fallback to default
stop tags if no args were provided.  I think we should indeed make
JapanesePartOfSpeechStopFilterFactory consistent.

Maybe, we fix this only in next major release (9.0), add an entry to
MIGRATE.txt explaining that, and go with option 2?  And possibly option 1
for 8.x releases?  (Or maybe don't fix it in 8.x releases... not sure).

Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 2, 2020 at 12:10 PM Michael Froh  wrote:

> I am currently working on migrating a project from an old version of Solr
> to Elasticsearch, and came across a funny (to me at least) difference in
> the "default" behavior of JapanesePartOfSpeechStopFilterFactory.
>
> If JapanesePartOfSpeechStopFilterFactory is given empty args, it does
> nothing. It doesn't load any stop tags, and just passes along the
> TokenStream passed to create(). (By comparison, the Elasticsearch filter
> will default to loading the stop tags shipped in the Kuromoji analyzer
> JAR.) So, for many years, my project was not using
> JapanesePartOfSpeechStopFilter, when I thought that it was.
>
> I would like to create an issue and submit a patch, in case other users
> out there are failing to use the filter factory correctly, but I'm not sure
> what the best approach is, between:
>
> 1. If someone doesn't specify the tags argument, then throw an exception
> (because the user probably doesn't know what they're doing).
> 2. If someone doesn't specify the tags argument, then load the default
> stop tags (like JapaneseAnalyzer does).
>
> I would lean more toward 1, to avoid a silent change in behavior.
>


Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

2020-10-02 Thread Michael Froh
I am currently working on migrating a project from an old version of Solr
to Elasticsearch, and came across a funny (to me at least) difference in
the "default" behavior of JapanesePartOfSpeechStopFilterFactory.

If JapanesePartOfSpeechStopFilterFactory is given empty args, it does
nothing. It doesn't load any stop tags, and just passes along the
TokenStream passed to create(). (By comparison, the Elasticsearch filter
will default to loading the stop tags shipped in the Kuromoji analyzer
JAR.) So, for many years, my project was not using
JapanesePartOfSpeechStopFilter, when I thought that it was.

I would like to create an issue and submit a patch, in case other users out
there are failing to use the filter factory correctly, but I'm not sure
what the best approach is, between:

1. If someone doesn't specify the tags argument, then throw an exception
(because the user probably doesn't know what they're doing).
2. If someone doesn't specify the tags argument, then load the default stop
tags (like JapaneseAnalyzer does).

I would lean more toward 1, to avoid a silent change in behavior.


Re: Highlight with Proximity search throws an exception

2020-10-02 Thread Michael McCandless
Hi Juraj+,

This indeed smells like a bug.  FuzzyTermsEnum should never try to set a
negative boost!

Could you open an issue and open a PR (or attach a patch) with your test
case?  Thank you for boiling this down.  This part really made me chuckle:

> When our text contains an apostrophe followed by a single character AND
we our search query is composed of exactly two letters followed by
proximity search AND we use highlighting, we get an exception:

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 1, 2020 at 12:48 PM Michael Sokolov  wrote:

> I traced this to this block in FuzzyTermsEnum:
>
> if (ed == 0) { // exact match
>   boostAtt.setBoost(1.0F);
> } else {
>   final int codePointCount = UnicodeUtil.codePointCount(term);
>   int minTermLength = Math.min(codePointCount, termLength);
>
>   float similarity = 1.0f - (float) ed / (float) minTermLength;
>   boostAtt.setBoost(similarity);
> }
>
> where in your test ed (edit distance) was 2 and minTermLength 1,
> leading to negative boost.
>
> I don't really understand this code at all, but I wonder if it should
> divide by maxTermLength instead of minTermLength?
>
> On Thu, Oct 1, 2020 at 9:54 AM Juraj Jurčo  wrote:
> >
> > Hi guys,
> > we are trying to implement search and we have experienced a strange
> situation. When our text contains an apostrophe followed by a single
> character AND we our search query is composed of exactly two letters
> followed by proximity search AND we use highlighting, we get an exception:
> >
> >> java.lang.IllegalArgumentException: boost must be a positive float, got
> -1.0
> >
> >
> > It seems there is a problem at:FuzzyTermsEnum.java:271 (float similarity
> = 1.0f - (float) ed / (float) minTermLength) when it reaches it with ed=2
> and it sets a negative boost.
> >
> > I was able to reproduce the error with following code:
> >
> > import java.io.IOException;
> > import java.nio.file.Path;
> >
> > import org.apache.commons.io.FileUtils;
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.core.SimpleAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.document.TextField;
> > import org.apache.lucene.index.IndexWriter;
> > import org.apache.lucene.index.IndexWriterConfig;
> > import org.apache.lucene.queryparser.classic.ParseException;
> > import org.apache.lucene.queryparser.classic.QueryParser;
> > import org.apache.lucene.search.Query;
> > import org.apache.lucene.search.highlight.Highlighter;
> > import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
> > import org.apache.lucene.search.highlight.QueryScorer;
> > import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
> > import org.apache.lucene.search.highlight.TokenSources;
> > import org.apache.lucene.store.Directory;
> > import org.apache.lucene.store.FSDirectory;
> > import org.junit.jupiter.api.Test;
> >
> > class FindSqlHighlightTest {
> >
> >@Test
> >void reproduceHighlightProblem() throws IOException, ParseException,
> InvalidTokenOffsetsException {
> >   String text = "doesn't";
> >   String field = "text";
> >   //NOK: se~, se~2 and any higher number
> >   //OK: sel~, s~, se~1
> >   String uQuery = "se~";
> >   int maxStartOffset = -1;
> >   Analyzer analyzer = new SimpleAnalyzer();
> >
> >   Path indexLocation = Path.of("temp",
> "reproduceHighlightProblem").toAbsolutePath();
> >   if (indexLocation.toFile().exists()) {
> >  FileUtils.deleteDirectory(indexLocation.toFile());
> >   }
> >   Directory indexDir = FSDirectory.open(indexLocation);
> >
> >   //Create index
> >   IndexWriterConfig dimsIndexWriterConfig = new
> IndexWriterConfig(analyzer);
> >
>  dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
> >   IndexWriter idxWriter = new IndexWriter(indexDir,
> dimsIndexWriterConfig);
> >   //add doc
> >   Document doc = new Document();
> >   doc.add(new TextField(field, text, Field.Store.NO));
> >   idxWriter.addDocument(doc);
> >   //commit
> >   idxWriter.commit();
> >   idxWriter.close();
> >
> >   //search & highlight
> >   Query query = new QueryParser(field, analyzer).parse(uQuery);
> >   Highlighter highlighter = new Highlighter(new
> SimpleHTMLFormatter(), new QueryScorer(query));
> >   TokenStream tokenStream = TokenSources.getTokenStream(field, null,
> text, analyzer, maxStartOffset);
> >   String highlighted = highlighter.getBestFragment(tokenStream,
> text);
> >   System.out.println(highlighted);
> >}
> > }
> >
> >
> > Could you please confirm whether it's a bug in Lucene or whether we do
> something that is not allowed?
> >
> > Thanks a lot!
> > Best,
> > Juraj+
>
> -
> To unsubscribe, e-mail: 

Re: Should ChildDocTransformerFactory's limit be local or global for deep-nested documents?

2020-10-02 Thread David Smiley
I think that's a bug!  Good catch!

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Oct 1, 2020 at 11:38 PM Alexandre Rafalovitch 
wrote:

> I am indexing a deeply nested structure and am trying to return it
> with fl=*,[child].
>
> And it is supposed to have 5 children under the top element but
> returns only 4. Two hours of debugging later, I realize that the
> "limit" parameter is set to 10 by default and that 10 seems to be
> counting children at ANY level. And calculating them depth-first. So,
> it was quite unobvious to discover when the children suddenly stopped
> showing up.
>
> The documentation says:
> > The maximum number of child documents to be returned per parent
> document. > The default is `10`.
>
> So, is that (all nested children included in limit) what we actually
> mean? Or did we mean maximum number of "immediate children" for any
> specific document/level and the code is wrong?
>
> I can update the doc to clarify the results, but I don't know whether
> I am looking at the bug or the feature.
>
> Regards,
>Alex.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


JDK 16 EA build 18 is now available

2020-10-02 Thread Rory O'Donnell

Hi Uwe & Dawid,

OpenJDK 16 Early Access build 18**is now available at http://jdk.java.net/16

 * These early-access , open-source builds are provided under the
 o GNU General Public License, version 2, with the Classpath
   Exception .

 * Features:
 o JEPs proposed to target JDK 16
 + JEP 376: ZGC: Concurrent Thread-Stack Processing
   
 + JEP 386: Alpine Linux Port 
 + JEP 388: Windows/AArch64 Port
   
 o JEPs targeted to JDK 16, so far:
 + JEP 338: Vector API (Incubator)
   
 + JEP 347: Enable C++14 Language Features
   
 + JEP 357: Migrate from Mercurial to Git
   
 + JEP 369: Migrate to GitHub 
 + JEP 387: Elastic Metaspace 

 * Release Notes are available at http://jdk.java.net/16/release-notes

**

 * Changes in recent builds that maybe of interest:
 o Build 18
 + JDK-8235710: Removal of Legacy Elliptic Curves
 + JDK-8245527: LDAP Channel Binding support for Java GSS/Kerberos
 + JDK-8252739: Deflater.setDictionary(byte[], int off, int
   len) ignores the starting offset for the dictionary
 # Reported by Apache Lucene
 o Build 17
 + JDK-8247281: Object monitors no longer keep strong
   references to their associated object
 + JDK-8202473: A type variable with multiple bounds does not
   correctly place type annotation
 # Reported by ByteBuddy
 + JDK-8234808: jdb quoted option parsing broken
 # Reported by Apache Tomcat
 o Build 16
 + JDK-8172366: SUN, SunRsaSign, and SunEC Providers Supports
   SHA-3 Based Signature Algorithms
 + JDK-8244706: GZIPOutputStream now sets the GZIP OS Header
   Field to the correct default value

 * Quality Report for September 2020 was published here [1]. Thanks to
   everyone who contributed by creating features or enhancements,
   logging  bugs, or downloading and testing the early-access builds.


*__*
Rgds,Rory

[1] 
https://wiki.openjdk.java.net/display/quality/Quality+Outreach+report+September+2020


--
Rgds, Rory O'Donnell
Quality Engineering Manager
Oracle EMEA, Dublin, Ireland