[
https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Parvulescu updated JCR-3075:
---------------------------------
Attachment: JCR-3075.patch
this problem also affects excerpt generation for any quoted phrase search, not
just japanese.
Normally a quoted phrase should be considered as only one item when building
the excerpt.
Now, because of LUCENE-2458, a normal search using japanese turns into the
equivalent of quoted search in let's say english.
So because the excerpt generator has issues dealing with phrases, then any
japanese search would have each character of the search token highlighted,
instead of just one highlight containing the whole word.
The patch should fix both the original issue, and highlighting for any quoted
search.
The problem is there is one test failing and I'm not sure why :(
The failing test is ExcerptTest#testEncodeIllegalCharsNoHighlights, which
apparently fails because there is more info on the node returned from the
search than expected.
This should not happen, as I haven't touched that part of the code (node
indexing), but sadly it does so I still need to investigate.
I'd also welcome some feedback on the approach.
> incorrect HTML excerpt generation for queries on japanese text content
> -----------------------------------------------------------------------
>
> Key: JCR-3075
> URL: https://issues.apache.org/jira/browse/JCR-3075
> Project: Jackrabbit Content Repository
> Issue Type: Bug
> Components: jackrabbit-core
> Reporter: Julian Reschke
> Priority: Minor
> Attachments: JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words.
> Test case (to be added to FullTextQueryTest):
> public void testJapaneseAndHighlight() throws RepositoryException {
> //
> http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
> String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
> // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
> String jTest = "\u30c6\u30b9\u30c8";
>
> String content = "some text with japanese: " + jContent
> + " ('content')" + " and " + jTest + " ('test').";
> // expected excerpt; note this may change if excerpt providers change
> String expectedExcerpt = "<div><span>some text with japanese: " +
> jContent
> + " ('content') and <strong>" + jTest
> + "</strong> ('test').</span></div>";
>
> Node n = testRootNode.addNode("node1");
> n.setProperty("title", content);
> testRootNode.getSession().save();
>
> String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
> + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
> Query q = superuser.getWorkspace().getQueryManager()
> .createQuery(xpath, Query.XPATH);
>
> QueryResult qr = q.execute();
> RowIterator it = qr.getRows();
> int cnt = 0;
> while (it.hasNext()) {
> cnt++;
> Row found = it.nextRow();
> assertEquals(n.getPath(), found.getPath());
> String excerpt = found.getValue("rep:excerpt(.)").getString();
> assertEquals(expectedExcerpt, excerpt);
> }
>
> assertEquals(1, cnt);
> }
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira