cleaner

Vincent Massol Fri, 27 Feb 2009 04:19:18 -0800

Hi Asiri,

On Feb 27, 2009, at 12:32 PM, asiri (SVN) wrote:


> Author: asiri
> Date: 2009-02-27 12:32:21 +0100 (Fri, 27 Feb 2009)
> New Revision: 17078
>
> Added:
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/AbstractHTMLCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/ 
> EmptyLineParagraphOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/ImageOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/InvalidTagOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/LineBreakOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/LinkOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/ListOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/MiscWysiwygCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/ 
> RedundantTagOpenOfficeCleaningTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/TableOpenOfficeCleaningTest.java
> Removed:
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/AbstractHTMLCleanerTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/OpenOfficeHTMLCleanerTest.java
>   platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/ 
> officeimporter/internal/cleaner/WysiwygHTMLCleanerTest.java
> Modified:
>   platform/core/trunk/xwiki-officeimporter/src/main/java/org/xwiki/ 
> officeimporter/filter/LineBreakFilter.java
> Log:
> XWIKI-3265: Restructure officeimporter test cases + write more tests
>
> * Completed.

[snip]

> +public class InvalidTagOpenOfficeCleaningTest extends  
> AbstractHTMLCleaningTest
> +{
> +    /**
> +     * {...@code <style>} tags should be stripped from html content.
> +     */
> +    public void testStyleTagRemoving()
> +    {
> +        String html =
> +            "<html><head><title>Title</title>" + "<style type= 
> \"text/css\">h1 {color:red} p {color:blue} </style>"
> +                + "</head><body>" + footer;
> +        Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("style");
> +        assertEquals(0, nodes.getLength());
> +    }
> +
> +    /**
> +     * {...@code <style>} tags should be stripped from html content.

copy paste, should be <script>.

> +     */
> +    public void testScriptTagRemoving()
> +    {
> +        String html = header + "<script type=\"text/javascript 
> \">document.write(\"Hello World!\")</script>" + footer;
> +        Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("script");
> +        assertEquals(0, nodes.getLength());
> +    }
> +}
>

[snip]

> +    /**
> +     * {...@code <br/>} elements placed next to paragraph elements  
> should be converted to {...@code<div
> +     * class="wikikmodel-emptyline"/>} elements.
> +     */
> +    public void testLineBreaksNextToParagraphElements()
> +    {
> +        checkLineBreakReplacements("<br/><br/><p>para</p>", 0, 2);
> +        checkLineBreakReplacements("<p>para</p><br/><br/>", 0, 2);
> +        checkLineBreakReplacements("<p>para</p><br/><br/><p>para</ 
> p>", 0, 2);
> +    }

Shouldn't this be done by the default HTML Cleaner?
Same for the other tests in this category.

> +    /**
> +     * The html generated by open office server includes anchors of  
> the form {...@code<a name="table1"><h1>Sheet 2:
> +     * <em>Hello</em></h1></a>} and the default html cleaner  
> converts them to {...@code <a name="table1"/><h1><a
> +     * name="table1">Sheet 1: <em>Hello</em></a></h1>} this is  
> because of the close-before-copy-inside
> +     * behaviour of default html cleaner. Thus the additional (copy- 
> inside) anchor needs to be ripped off.

This looks like a bug in the default HTML cleaner no?

> +    /**
> +     * If there are leading spaces within the content of a list  
> item ({...@code<li/>}) they should be trimmed.
> +     */
> +    public void testListItemContentLeadingSpaceTrimming()
> +    {
> +        String html = header + "<ol><li> Test</li></ol>" + footer;
> +        Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("li");
> +        Node listContent = nodes.item(0).getFirstChild();
> +        assertEquals(Node.TEXT_NODE, listContent.getNodeType());
> +        assertEquals("Test", listContent.getNodeValue());
> +    }

Shouldn't this be done in the default HTML cleaner? Actually I think  
this is already done in the XHTML parser by the whitespace XML filter.  
If not then it's a bug of the whitespace filter.

For all bugs please refer to the jira issue in the javadoc and explain  
that the code will be removed once the bug is fixed.

> +
> +    /**
> +     * If there is a leading paragraph inside a list item, it  
> should be replaced with it's content.
> +     */
> +    public void testListItemContentIsolatedParagraphCleaning()
> +    {
> +        String html = header + "<ol><li><p>Test</p></li></ol>" +  
> footer;
> +        Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("li");
> +        Node listContent = nodes.item(0).getFirstChild();
> +        assertEquals(Node.TEXT_NODE, listContent.getNodeType());
> +        assertEquals("Test", listContent.getNodeValue());
> +    }
> +}

This should be handled by a combination of both XHTML parser and Wiki  
Syntax Renderer and/or by the default HTML cleaner.

> +    /**
> +     * Test cleaning of html paragraphs brearing namespaces.
> +     */
> +    public void testParagraphsWithNamespaces()
> +    {
> +        String html = header + "<w:p>paragraph</w:p>" + footer;
> +        Document doc =
> +            wysiwygHTMLCleaner.clean(new StringReader(html),  
> Collections.singletonMap(HTMLCleaner.NAMESPACES_AWARE,
> +                "false"));
> +        NodeList nodes = doc.getElementsByTagName("p");
> +        assertEquals(1, nodes.getLength());
> +    }

hmmm... I think this needs to be reviewed and we need to check if the  
wikimodel XHTML parser supports namespaces.

> +
> +    /**
> +     * The source of the images in copy pasted html content should  
> be replaces with 'Missing.png' since they can't be
> +     * uploaded automatically.
> +     */
> +    public void testImageFiltering()
> +    {
> +        String html = header + "<img src=\"file://path/to/local/image.png 
> \"/>" + footer;
> +        Document doc = wysiwygHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("img");
> +        assertEquals(1, nodes.getLength());
> +        Element image = (Element) nodes.item(0);
> +        Node startComment = image.getPreviousSibling();
> +        Node stopComment = image.getNextSibling();
> +        assertEquals(Node.COMMENT_NODE, startComment.getNodeType());
> +         
> assertTrue 
> (startComment.getNodeValue().equals("startimage:Missing.png"));

It should be lowercase "missing.png". So this means a missing.png  
image need to be present in all skins?

Has this been discussed and is everyone aware of this?

> +    /**
> +     * Test filtering of those tags which doesn't have any  
> attributes set.
> +     */
> +    public void testFilterIfZeroAttributes()
> +    {
> +        String htmlTemplate = header + "<p>Test%sRedundant 
> %sFiltering</p>" + footer;
> +        String[] filterIfZeroAttributesTags = new String[] {"span",  
> "div"};
> +        for (String tag : filterIfZeroAttributesTags) {
> +            String startTag = "<" + tag + ">";
> +            String endTag = "</" + tag + ">";
> +            String html = String.format(htmlTemplate, startTag,  
> endTag);
> +            Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +            NodeList nodes = doc.getElementsByTagName(tag);
> +            assertEquals(0, nodes.getLength());
> +        }
> +    }

Shouldn't this be done in the default HTML cleaner?

> +
> +    /**
> +     * Test filtering of those tags which doesn't have any textual  
> content in them.
> +     */
> +    public void testFilterIfNoContent()
> +    {
> +        String htmlTemplate = header + "<p>Test%sRedundant%s%s 
> %sFiltering</p>" + footer;
> +        String[] filterIfNoContentTags =
> +            new String[] {"em", "strong", "dfn", "code", "samp",  
> "kbd", "var", "cite", "abbr", "acronym", "address",
> +            "blockquote", "q", "pre", "h1", "h2", "h3", "h4", "h5",  
> "h6"};
> +        for (String tag : filterIfNoContentTags) {
> +            String startTag = "<" + tag + ">";
> +            String endTag = "</" + tag + ">";
> +            String html = String.format(htmlTemplate, startTag,  
> endTag, startTag, endTag);
> +            Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +            NodeList nodes = doc.getElementsByTagName(tag);
> +            assertEquals(1, nodes.getLength());
> +        }
> +    }
> +}

Shouldn't this be done in the default HTML cleaner?

> +    /**
> +     * An isolated paragraph inside a table cell item should be  
> replaced with paragraph's content.
> +     */
> +    public void testTableCellItemIsolatedParagraphCleaning()
> +    {
> +        String html = header + "<table><tr><td><p>Test</p></td></ 
> tr></table>" + footer;
> +        Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("td");
> +        Node cellContent = nodes.item(0).getFirstChild();
> +        assertEquals(Node.TEXT_NODE, cellContent.getNodeType());
> +        assertEquals("Test", cellContent.getNodeValue());
> +    }

Isn't this already tested above?
In any case shouldn't this be moved out of the importer?
Same for other tests  in the same category.

> +    /**
> +     * If multiple paragraphs are found inside a table cell item,  
> they should be wrapped in an embedded document.
> +     */
> +    public void testTableCellItemMultipleParagraphWrapping()
> +    {
> +        assertEquals(true,  
> checkEmbeddedDocumentGeneration("<table><tr><td><p>Test</p><p>Test</ 
> p></td></tr></table>",
> +            "td"));
> +    }

This looks like a bug in the XHTML parser.
Same for other tests in the same category.

> +
> +    /**
> +     * Empty rows should be removed.
> +     */
> +    public void testEmptyRowRemoving()
> +    {
> +        String html = header + "<table><tbody><tr><td>cell</td></ 
> tr><tr></tr></tbody></table>" + footer;
> +        Document doc = openOfficeHTMLCleaner.clean(new  
> StringReader(html));
> +        NodeList nodes = doc.getElementsByTagName("tr");
> +        assertEquals(1, nodes.getLength());
> +    }

Shouldn't this be done in the default HTML cleaner?

Thanks
-Vincent
http://xwiki.com
http://xwiki.org
http://massol.net






_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [xwiki-notifications] r17078 - in platform/core/trunk/xwiki-officeimporter/src: main/java/org/xwiki/officeimporter/filter test/java/org/xwiki/officeimporter/internal/cleaner

Reply via email to