OyvindLGjesdal commented on issue #1581:
URL: https://github.com/apache/jena/issues/1581#issuecomment-1296186174

   I think the current documentation points to following the Lucene behavior, 
since it is mentioned multiple times that the StandardAnalyzer from Lucene is 
used (and implicitly its behavior?)
   
   >  The default analyzer defaults to Lucene’s StandardAnalyzer.
   
   >  If a Lucene or Elasticsearch text index is used, then by default the 
Lucene StandardAnalyzer is used.
   
   > The multilingual analyzer becomes the default analyzer and the Lucene 
StandardAnalyzer is the default analyzer used when there is no language tag.
   
   Maybe a note could be added in the documentation
   
   **Note** From Lucene version 9 English stopwords are no longer removed by 
default in StandardAnalyzer. This also changesthe default behavior for Jena 
4.X. You can keep the old behavior by configuring a custom analyzer in the 
assembler. (link to custom analyzer or source code of assembler containing list 
of english stop words?)
   
   (List from 
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L48
   ```
   ("a" "an" "and" "are" "as" "at" "be" "but" "by" "for" "if" "in" 
    "into" "is" "it" "no" "not" "of" "on" "or" "such" "that" "the" 
   "their" "then" "there" "these" "they" "this" "to" "was" "will" "with")  
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to