[jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Alex Parvulescu (JIRA) Wed, 21 Sep 2011 05:13:33 -0700

     [ 
https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alex Parvulescu updated JCR-3075:
---------------------------------

    Attachment: JCR-3075.patch

this problem also affects excerpt generation for any quoted phrase search, not 
just japanese.
Normally a quoted phrase should be considered as only one item when building 
the excerpt.

Now, because of LUCENE-2458, a normal search using japanese turns into the 
equivalent of quoted search in let's say english.
So because the excerpt generator has issues dealing with phrases, then any 
japanese search would have each character of the search token highlighted, 
instead of just one highlight containing the whole word.

The patch should fix both the original issue, and highlighting for any quoted 
search.
The problem is there is one test failing and I'm not sure why :(

The failing test is ExcerptTest#testEncodeIllegalCharsNoHighlights, which 
apparently fails because there is more info on the node returned from the 
search than expected.
This should not happen, as I haven't touched that part of the code (node 
indexing), but sadly it does so I still need to investigate.

I'd also welcome some feedback on the approach.

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>         Attachments: JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. 
> Test case (to be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // 
> http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " + 
> jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content

Reply via email to