[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3888: --- Fix Version/s: (was: 3.6) Thanks Robert for giving some patches and comment. {quote} The only option for 3.6 would be something like my previous patch (https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch) which has the disadvantages of doing the second-phase re-ranking on surface forms. {quote} With the disadvantages, the spell checker won't work well for Japanese anyway. I give up this for 3.6. split off the spell check word and surface form in spell check dictionary - Key: LUCENE-3888 URL: https://issues.apache.org/jira/browse/LUCENE-3888 Project: Lucene - Java Issue Type: Improvement Components: modules/spellchecker Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 4.0 Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch The did you mean? feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker. I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3888: --- Attachment: LUCENE-3888.patch I added a test for the surface analyzer. I also added code for the analyzer in Solr. Currently, due to classpath problem, the test cannot be compiled. I should dig in, but if someone could, it would be appreciated. split off the spell check word and surface form in spell check dictionary - Key: LUCENE-3888 URL: https://issues.apache.org/jira/browse/LUCENE-3888 Project: Lucene - Java Issue Type: Improvement Components: modules/spellchecker Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.6, 4.0 Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch The did you mean? feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker. I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3888: --- Attachment: LUCENE-3888.patch The patch cannot be compiled now because I changed the return type of the method in Dictionary interface but all implemented classes have not been changed. Please give some comment because I'm new to spell checker. If no problem to go, I'll continue to work. split off the spell check word and surface form in spell check dictionary - Key: LUCENE-3888 URL: https://issues.apache.org/jira/browse/LUCENE-3888 Project: Lucene - Java Issue Type: Improvement Components: modules/spellchecker Reporter: Koji Sekiguchi Priority: Minor Fix For: 3.6, 4.0 Attachments: LUCENE-3888.patch The did you mean? feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker. I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2418) remove deprecated highlighting/ syntax
[ https://issues.apache.org/jira/browse/SOLR-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2418: - Fix Version/s: (was: 3.6) remove deprecated highlighting/ syntax Key: SOLR-2418 URL: https://issues.apache.org/jira/browse/SOLR-2418 Project: Solr Issue Type: Task Components: highlighter Affects Versions: 1.4.1, 3.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 4.0 excerpt from CHANGES.txt: {noformat} == 3.1.0-dev == snip Upgrading from Solr 1.4 -- snip * Old syntax of highlighting configuration in solrconfig.xml is deprecated (SOLR-1696) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2202) Money/Currency FieldType
[ https://issues.apache.org/jira/browse/SOLR-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2202: - Attachment: SOLR-2202-fix-NPE-if-no-tlong-fieldType.patch A draft patch which hasn't been tested yet. Money/Currency FieldType Key: SOLR-2202 URL: https://issues.apache.org/jira/browse/SOLR-2202 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.5 Reporter: Greg Fodor Assignee: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2022-solr-3.patch, SOLR-2202-fix-NPE-if-no-tlong-fieldType.patch, SOLR-2202-lucene-1.patch, SOLR-2202-solr-1.patch, SOLR-2202-solr-10.patch, SOLR-2202-solr-2.patch, SOLR-2202-solr-4.patch, SOLR-2202-solr-5.patch, SOLR-2202-solr-6.patch, SOLR-2202-solr-7.patch, SOLR-2202-solr-8.patch, SOLR-2202-solr-9.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch, SOLR-2202.patch Provides support for monetary values to Solr/Lucene with query-time currency conversion. The following features are supported: - Point queries - Range quries - Sorting - Currency parsing by either currency code or symbol. - Symmetric Asymmetric exchange rates. (Asymmetric exchange rates are useful if there are fees associated with exchanging the currency.) At indexing time, money fields can be indexed in a native currency. For example, if a product on an e-commerce site is listed in Euros, indexing the price field as 1000,EUR will index it appropriately. By altering the currency.xml file, the sorting and querying against Solr can take into account fluctuations in currency exchange rates without having to re-index the documents. The new money field type is a polyfield which indexes two fields, one which contains the amount of the value and another which contains the currency code or symbol. The currency metadata (names, symbols, codes, and exchange rates) are expected to be in an xml file which is pointed to by the field type declaration in the schema.xml. The current patch is factored such that Money utility functions and configuration metadata lie in Lucene (see MoneyUtil and CurrencyConfig), while the MoneyType and MoneyValueSource lie in Solr. This was meant to mirror the work being done on the spacial field types. This patch will be getting used to power the international search capabilities of the search engine at Etsy. Also see WIKI page: http://wiki.apache.org/solr/MoneyFieldType -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2909) Add support for ResourceLoaderAware tokenizerFactories in synonym filter factories
[ https://issues.apache.org/jira/browse/SOLR-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2909: - Attachment: SOLR-2909.patch Add support for ResourceLoaderAware tokenizerFactories in synonym filter factories -- Key: SOLR-2909 URL: https://issues.apache.org/jira/browse/SOLR-2909 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.4, 4.0 Reporter: Tom Klonikowski Assignee: Koji Sekiguchi Priority: Minor Attachments: SOLR-2909.patch The optional custom tokenizerFactory in SlowSynonymFilterFactory and FSTSynonymFilterFactory might require the ResourceLoader information. Thus inform(ResourceLoader) should be called if the specified tokenizerFactory is an instance of ResourceLoaderAware. {noformat} private static TokenizerFactory loadTokenizerFactory(ResourceLoader loader, String cname, MapString, String args) { TokenizerFactory tokFactory = (TokenizerFactory) loader.newInstance(cname); tokFactory.init(args); if (tokFactory instanceof ResourceLoaderAware) { ((ResourceLoaderAware) tokFactory).inform(loader); } return tokFactory; } {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3055) Use NGramPhraseQuery in Solr
[ https://issues.apache.org/jira/browse/SOLR-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-3055: - Attachment: SOLR-3055.patch How about introducing something like GramSizeAttribute? I attached just an idea and draft level patch. Use NGramPhraseQuery in Solr Key: SOLR-3055 URL: https://issues.apache.org/jira/browse/SOLR-3055 Project: Solr Issue Type: New Feature Components: Schema and Analysis, search Reporter: Koji Sekiguchi Priority: Minor Attachments: SOLR-3055.patch Solr should use NGramPhraseQuery when searching with default slop on n-gram field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3698) FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text
[ https://issues.apache.org/jira/browse/LUCENE-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3698: --- Attachment: LUCENE-3698.patch Updated patch. As I think this fix changes runtime behavior, I added an entry to CHANGES.txt. I changed Field.isTokenized() to Field.fieldType().tokenized() as well because it cannot be compiled in trunk. FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text - Key: LUCENE-3698 URL: https://issues.apache.org/jira/browse/LUCENE-3698 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Reporter: Shay Banon Attachments: LUCENE-3698.patch, LUCENE-3698.patch The FVH adds an additional ' ' (the multi value separator) to the end of the highlighted text. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3698) FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text
[ https://issues.apache.org/jira/browse/LUCENE-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3698: --- Priority: Minor (was: Major) Fix Version/s: 4.0 3.6 Assignee: Koji Sekiguchi FastVectorHighlighter adds a multi value separator (space) to the end of the highlighted text - Key: LUCENE-3698 URL: https://issues.apache.org/jira/browse/LUCENE-3698 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Reporter: Shay Banon Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.6, 4.0 Attachments: LUCENE-3698.patch, LUCENE-3698.patch The FVH adds an additional ' ' (the multi value separator) to the end of the highlighted text. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3697) FastVectorHighlighter SimpleBoundaryScanner does not work well when highlighting at the beginning of the text
[ https://issues.apache.org/jira/browse/LUCENE-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3697: --- Attachment: LUCENE-3697.patch patch that just fixing the expected string in #testMVSeparator. FastVectorHighlighter SimpleBoundaryScanner does not work well when highlighting at the beginning of the text -- Key: LUCENE-3697 URL: https://issues.apache.org/jira/browse/LUCENE-3697 Project: Lucene - Java Issue Type: Bug Components: modules/highlighter Reporter: Shay Banon Attachments: LUCENE-3697.patch, LUCENE-3697.patch The SimpleBoundaryScanner still breaks text not based on characters provided when highlighting text that end up scanning to the beginning of the text to highlight. In this case, just use the start of the text as the offset. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()
[ https://issues.apache.org/jira/browse/SOLR-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-3012: - Fix Version/s: 4.0 3.6 SimplePostTool: move getProperty(type) out of postData() Key: SOLR-3012 URL: https://issues.apache.org/jira/browse/SOLR-3012 Project: Solr Issue Type: Improvement Reporter: Koji Sekiguchi Priority: Trivial Fix For: 3.6, 4.0 Attachments: SOLR-3012.patch Now applications that use SimplePostTool can set Content-type, but it has to use type system property. {code} public void postData(InputStream data, Integer length, OutputStream output) { final String type = System.getProperty(type, DEFAULT_DATA_TYPE); : } {code} If the getProperty() is moved to main() and type can be set via an argument of the method, the client applications can be flexible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()
[ https://issues.apache.org/jira/browse/SOLR-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-3012: - Attachment: SOLR-3012.patch New patch. As the clients of SimplePostTool will be needed recompile due to this change, I added change note in this patch. I'll commit tonight if nobody objects. SimplePostTool: move getProperty(type) out of postData() Key: SOLR-3012 URL: https://issues.apache.org/jira/browse/SOLR-3012 Project: Solr Issue Type: Improvement Reporter: Koji Sekiguchi Priority: Trivial Fix For: 3.6, 4.0 Attachments: SOLR-3012.patch, SOLR-3012.patch Now applications that use SimplePostTool can set Content-type, but it has to use type system property. {code} public void postData(InputStream data, Integer length, OutputStream output) { final String type = System.getProperty(type, DEFAULT_DATA_TYPE); : } {code} If the getProperty() is moved to main() and type can be set via an argument of the method, the client applications can be flexible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3012) SimplePostTool: move getProperty(type) out of postData()
[ https://issues.apache.org/jira/browse/SOLR-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-3012: - Attachment: SOLR-3012.patch I also changed doGet methods to static in this patch. SimplePostTool: move getProperty(type) out of postData() Key: SOLR-3012 URL: https://issues.apache.org/jira/browse/SOLR-3012 Project: Solr Issue Type: Improvement Reporter: Koji Sekiguchi Priority: Trivial Attachments: SOLR-3012.patch Now applications that use SimplePostTool can set Content-type, but it has to use type system property. {code} public void postData(InputStream data, Integer length, OutputStream output) { final String type = System.getProperty(type, DEFAULT_DATA_TYPE); : } {code} If the getProperty() is moved to main() and type can be set via an argument of the method, the client applications can be flexible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2346: - Attachment: SOLR-2346.patch New patch attached. I updated for current trunk and getCharsetFromContentType() method to remove unnecessary strings after the charset value. I think this is ready to go. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1, 3.1, 4.0 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.6, 4.0 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2346: - Attachment: SOLR-2346.patch bq. getCharsetFromContentType() method to remove unnecessary strings after the charset value. My fault. This is not necessary. I should add --data-binary option to curl. Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1, 3.1, 4.0 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.6, 4.0 Attachments: NormalSave.msg, SOLR-2346.patch, SOLR-2346.patch, SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3640) remove IndexSearcher.close
[ https://issues.apache.org/jira/browse/LUCENE-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3640: --- Fix Version/s: (was: 3.6) remove 3.6 tag from Fix Version/s remove IndexSearcher.close -- Key: LUCENE-3640 URL: https://issues.apache.org/jira/browse/LUCENE-3640 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3640.patch, LUCENE-3640.patch Now that IS is never heavy (since you have to pass in your own IR), IS.close is truly a no-op... I think we should remove it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2922) Upgrade commons io and lang in Solr
[ https://issues.apache.org/jira/browse/SOLR-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2922: - Attachment: SOLR-2922.patch Upgrade commons io and lang in Solr --- Key: SOLR-2922 URL: https://issues.apache.org/jira/browse/SOLR-2922 Project: Solr Issue Type: Improvement Affects Versions: 3.5, 4.0 Reporter: Koji Sekiguchi Priority: Trivial Attachments: SOLR-2922.patch Upgrade commons-io and commons-lang in Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper
[ https://issues.apache.org/jira/browse/LUCENE-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-2949: --- Assignee: (was: Koji Sekiguchi) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper - Key: LUCENE-2949 URL: https://issues.apache.org/jira/browse/LUCENE-2949 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.0.3, 4.0 Reporter: Grant Ingersoll Priority: Minor Labels: FastVectorHighlighter, Highlighter Fix For: 3.5, 4.0 Attachments: LUCENE-2949.patch Based on my reading of the FieldTermStack constructor that loads the vector from disk, we could probably save a bunch of time and memory by using the TermVectorMapper callback mechanism instead of materializing the full array of terms into memory and then throwing most of them out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1926) add hl.q parameter
[ https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1926: - Attachment: SOLR-1926.patch Just removing hl.text parameter. add hl.q parameter -- Key: SOLR-1926 URL: https://issues.apache.org/jira/browse/SOLR-1926 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.5, 4.0 Attachments: SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch If hl.q parameter is set, HighlightComponent uses it rather than q. Use case: You search PC with highlight and facet capability: {code} q=PC facet=onfacet.field=makerfacet.field=something hl=onhl.fl=desc {code} You get a lot of results with snippets (term PC highlighted in desc field). Then you click a link maker:DELL(50) to narrow the result: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc {code} You'll get narrowed result with term PC highlighted snippets. But, sometimes I'd like to see DELL to be highlighted as well, because I clicked DELL. In this case, hl.q can be used: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc*hl.q=PC+maker:DELL* {code} Note that hl.requireFieldMatch should be false (false is default) in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1926) add hl.q parameter
[ https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1926: - Attachment: SOLR-1926.patch Added localParams test for hl.q parameter. I'll commit soon. add hl.q parameter -- Key: SOLR-1926 URL: https://issues.apache.org/jira/browse/SOLR-1926 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.5, 4.0 Attachments: SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch If hl.q parameter is set, HighlightComponent uses it rather than q. Use case: You search PC with highlight and facet capability: {code} q=PC facet=onfacet.field=makerfacet.field=something hl=onhl.fl=desc {code} You get a lot of results with snippets (term PC highlighted in desc field). Then you click a link maker:DELL(50) to narrow the result: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc {code} You'll get narrowed result with term PC highlighted snippets. But, sometimes I'd like to see DELL to be highlighted as well, because I clicked DELL. In this case, hl.q can be used: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc*hl.q=PC+maker:DELL* {code} Note that hl.requireFieldMatch should be false (false is default) in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1926) add hl.q parameter
[ https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1926: - Attachment: SOLR-1926.patch This patch supports both hl.q and hl.text. The priority is: # Highlighter looks if there is hl.text and if it exists, uses it. FVH doesn't look it for performance reasons. # If hl.text doesn't exist, hl.q will be used. # If hl.q doesn't exist, q will be used. localParams can be used in hl.q, and hl.text parameter accepts per-field override. add hl.q parameter -- Key: SOLR-1926 URL: https://issues.apache.org/jira/browse/SOLR-1926 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.4 Reporter: Koji Sekiguchi Priority: Trivial Attachments: SOLR-1926.patch, SOLR-1926.patch If hl.q parameter is set, HighlightComponent uses it rather than q. Use case: You search PC with highlight and facet capability: {code} q=PC facet=onfacet.field=makerfacet.field=something hl=onhl.fl=desc {code} You get a lot of results with snippets (term PC highlighted in desc field). Then you click a link maker:DELL(50) to narrow the result: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc {code} You'll get narrowed result with term PC highlighted snippets. But, sometimes I'd like to see DELL to be highlighted as well, because I clicked DELL. In this case, hl.q can be used: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc*hl.q=PC+maker:DELL* {code} Note that hl.requireFieldMatch should be false (false is default) in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1926) add hl.q parameter
[ https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1926: - Fix Version/s: 4.0 3.5 add hl.q parameter -- Key: SOLR-1926 URL: https://issues.apache.org/jira/browse/SOLR-1926 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.5, 4.0 Attachments: SOLR-1926.patch, SOLR-1926.patch If hl.q parameter is set, HighlightComponent uses it rather than q. Use case: You search PC with highlight and facet capability: {code} q=PC facet=onfacet.field=makerfacet.field=something hl=onhl.fl=desc {code} You get a lot of results with snippets (term PC highlighted in desc field). Then you click a link maker:DELL(50) to narrow the result: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc {code} You'll get narrowed result with term PC highlighted snippets. But, sometimes I'd like to see DELL to be highlighted as well, because I clicked DELL. In this case, hl.q can be used: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc*hl.q=PC+maker:DELL* {code} Note that hl.requireFieldMatch should be false (false is default) in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1926) add hl.q parameter
[ https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1926: - Attachment: SOLR-1926.patch New patch. I added a test case. add hl.q parameter -- Key: SOLR-1926 URL: https://issues.apache.org/jira/browse/SOLR-1926 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.4 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Trivial Fix For: 3.5, 4.0 Attachments: SOLR-1926.patch, SOLR-1926.patch, SOLR-1926.patch If hl.q parameter is set, HighlightComponent uses it rather than q. Use case: You search PC with highlight and facet capability: {code} q=PC facet=onfacet.field=makerfacet.field=something hl=onhl.fl=desc {code} You get a lot of results with snippets (term PC highlighted in desc field). Then you click a link maker:DELL(50) to narrow the result: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc {code} You'll get narrowed result with term PC highlighted snippets. But, sometimes I'd like to see DELL to be highlighted as well, because I clicked DELL. In this case, hl.q can be used: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc*hl.q=PC+maker:DELL* {code} Note that hl.requireFieldMatch should be false (false is default) in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1926) add hl.q parameter
[ https://issues.apache.org/jira/browse/SOLR-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1926: - Attachment: SOLR-1926.patch The first draft patch. It implements only hl.q. I'm still working for hl.text that was suggested by Hoss. add hl.q parameter -- Key: SOLR-1926 URL: https://issues.apache.org/jira/browse/SOLR-1926 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.4 Reporter: Koji Sekiguchi Priority: Trivial Attachments: SOLR-1926.patch If hl.q parameter is set, HighlightComponent uses it rather than q. Use case: You search PC with highlight and facet capability: {code} q=PC facet=onfacet.field=makerfacet.field=something hl=onhl.fl=desc {code} You get a lot of results with snippets (term PC highlighted in desc field). Then you click a link maker:DELL(50) to narrow the result: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc {code} You'll get narrowed result with term PC highlighted snippets. But, sometimes I'd like to see DELL to be highlighted as well, because I clicked DELL. In this case, hl.q can be used: {code} q=PC facet=onfacet.field=something fq=maker:DELL hl=onhl.fl=desc*hl.q=PC+maker:DELL* {code} Note that hl.requireFieldMatch should be false (false is default) in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3440: --- Attachment: LUCENE-3440.patch New patch, still has failures in test, though. FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3440: --- Attachment: (was: LUCENE-3440.patch) FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2837) listing supported languages
[ https://issues.apache.org/jira/browse/SOLR-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2837: - Attachment: SOLR-2837.patch Draft patch that is not ideal version. This is uncool but works for me today. {code} $ cd contrib/langid $ ant -emacs list-supported-lang list-supported-lang: da is it no hu th de el fi pt pl sv fr en ru et es nl BUILD SUCCESSFUL Total time: 0 seconds {code} listing supported languages --- Key: SOLR-2837 URL: https://issues.apache.org/jira/browse/SOLR-2837 Project: Solr Issue Type: Improvement Components: contrib - LangId Affects Versions: 3.5, 4.0 Reporter: Koji Sekiguchi Priority: Minor Attachments: SOLR-2837.patch As a user of langid, I'd like to know which languages are supported by current langid, ideally via admin gui. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2838) use preferable default for langid.idField
[ https://issues.apache.org/jira/browse/SOLR-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-2838: - Attachment: SOLR-2838.patch use preferable default for langid.idField - Key: SOLR-2838 URL: https://issues.apache.org/jira/browse/SOLR-2838 Project: Solr Issue Type: Improvement Components: contrib - LangId Affects Versions: 3.5, 4.0 Reporter: Koji Sekiguchi Priority: Trivial Attachments: SOLR-2838.patch langid.idField is used for logging purpose in langid. If it is not set, id is set as default. But if no id field is there and the parameter is likely hidden and therefore indiscernible for users, those users got undesirable warnings in the log: {noformat} WARNING: Document *null* does not contain input field subject. Skipping this field. {noformat} As we can access IndexSchema in initParams(), why don't we use uniqueKey field as the default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3513) Add SimpleFragListBuilder constructor with margin parameter
[ https://issues.apache.org/jira/browse/LUCENE-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3513: --- Affects Version/s: (was: 3.4) 2.9 Fix Version/s: 4.0 3.5 Assignee: Koji Sekiguchi Add SimpleFragListBuilder constructor with margin parameter --- Key: LUCENE-3513 URL: https://issues.apache.org/jira/browse/LUCENE-3513 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 2.9 Reporter: Kelsey Francis Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.5, 4.0 Attachments: LUCENE-3513.patch, LUCENE-3513.patch {{SimpleFragListBuilder}} would benefit from an additional constructor that takes in {{margin}}. Currently, the margin is defined as a constant, so to implement a {{FragListBuilder}} with a different margin, one has no choice but to copy and paste {{SimpleFragListBuilder}} into a new class that must be placed in the {{org.apache.lucene.search.vectorhighlight}} package due to accesses of package-protected fields in other classes. If this change were made, the precondition check of the constructor's {{fragCharSize}} should probably be altered to ensure that it's less than {{max(1, margin*3)}} to allow for a margin of 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3513) Add SimpleFragListBuilder constructor with margin parameter
[ https://issues.apache.org/jira/browse/LUCENE-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3513: --- Attachment: LUCENE-3513.patch Updated patch attached. I needed to update test. I've also changed the visibility of the new member variable to package default for the test. I'll commit soon. Add SimpleFragListBuilder constructor with margin parameter --- Key: LUCENE-3513 URL: https://issues.apache.org/jira/browse/LUCENE-3513 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 2.9 Reporter: Kelsey Francis Priority: Minor Fix For: 3.5, 4.0 Attachments: LUCENE-3513.patch, LUCENE-3513.patch {{SimpleFragListBuilder}} would benefit from an additional constructor that takes in {{margin}}. Currently, the margin is defined as a constant, so to implement a {{FragListBuilder}} with a different margin, one has no choice but to copy and paste {{SimpleFragListBuilder}} into a new class that must be placed in the {{org.apache.lucene.search.vectorhighlight}} package due to accesses of package-protected fields in other classes. If this change were made, the precondition check of the constructor's {{fragCharSize}} should probably be altered to ensure that it's less than {{max(1, margin*3)}} to allow for a margin of 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3440: --- Attachment: LUCENE-3440.patch In this patch, I removed FieldFragList interface and renamed BaseFieldFragList to FieldFragList, and moved javadocs to the abstract from interface. I'm still working. FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, weight-vs-boost_table02.html The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1895) ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time
[ https://issues.apache.org/jira/browse/SOLR-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated SOLR-1895: - Attachment: SOLR-1895-queries.patch I fixed the patch for distributed search. I also modified dot.classpath file in this patch. ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time -- Key: SOLR-1895 URL: https://issues.apache.org/jira/browse/SOLR-1895 Project: Solr Issue Type: New Feature Components: SearchComponents - other Reporter: Karl Wright Labels: document, security, solr Fix For: 3.5, 4.0 Attachments: LCFSecurityFilter.java, LCFSecurityFilter.java, LCFSecurityFilter.java, LCFSecurityFilter.java, SOLR-1895-queries.patch, SOLR-1895-queries.patch, SOLR-1895-queries.patch, SOLR-1895-queries.patch, SOLR-1895-queries.patch, SOLR-1895-service-plugin.patch, SOLR-1895-service-plugin.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch I've written an LCF SearchComponent which filters returned results based on access tokens provided by LCF's authority service. The component requires you to configure the appropriate authority service URL base, e.g.: !-- LCF document security enforcement component -- searchComponent name=lcfSecurity class=LCFSecurityFilter str name=AuthorityServiceBaseURLhttp://localhost:8080/lcf-authority-service/str /searchComponent Also required are the following schema.xml additions: !-- Security fields -- field name=allow_token_document type=string indexed=true stored=false multiValued=true/ field name=deny_token_document type=string indexed=true stored=false multiValued=true/ field name=allow_token_share type=string indexed=true stored=false multiValued=true/ field name=deny_token_share type=string indexed=true stored=false multiValued=true/ Finally, to tie it into the standard request handler, it seems to need to run last: requestHandler name=standard class=solr.SearchHandler default=true arr name=last-components strlcfSecurity/str /arr ... I have not set a package for this code. Nor have I been able to get it reviewed by someone as conversant with Solr as I would prefer. It is my hope, however, that this module will become part of the standard Solr 1.5 suite of search components, since that would tie it in with LCF nicely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org