Hi Asiri,

Are you sure this should go in the office importer and not in the HTML  
cleaner or in the XHMTL parser?

Thanks
-Vincent

On Nov 24, 2008, at 6:08 PM, asiri (SVN) wrote:

> Author: asiri
> Date: 2008-11-24 18:08:51 +0100 (Mon, 24 Nov 2008)
> New Revision: 14425
>
> Modified:
>   sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/ 
> plugin/officeimporter/filter/RedundantTagFilter.java
> Log:
> XAOFFICE-1 : Develop the initial feature set for office-importer  
> plugin.
>
> * Added support for filtering empty / redundant paragraphs.
>
> Modified: sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/ 
> xwiki/plugin/officeimporter/filter/RedundantTagFilter.java
> ===================================================================
> --- sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/ 
> plugin/officeimporter/filter/RedundantTagFilter.java  2008-11-24  
> 15:17:17 UTC (rev 14424)
> +++ sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/ 
> plugin/officeimporter/filter/RedundantTagFilter.java  2008-11-24  
> 17:08:51 UTC (rev 14425)
> @@ -31,12 +31,13 @@
>
>     public void filter(Document document, ImporterContext context)
>     {
> -        for(String key : attributeWiseFilteredTags) {
> +        for (String key : attributeWiseFilteredTags) {
>              
> filterNodesWithZeroAttributes(document.getElementsByTagName(key));
>         }
> -        for(String key : contentWiseFilteredTags) {
> +        for (String key : contentWiseFilteredTags) {
>              
> filterNodesWithEmptyTextContent(document.getElementsByTagName(key));
> -        }
> +        }
> +        filterEmptyParagraphs(document);
>     }
>
>     /**
> @@ -70,10 +71,35 @@
>     {
>         for (int i = 0; i < elements.getLength(); i++) {
>             Element element = (Element) elements.item(i);
> -            if (element.getTextContent().trim().equals("")) {
> +            if (element.getTextContent().trim().equals("")) {
>                 element.getParentNode().removeChild(element);
>                 i--;
>             }
>         }
>     }
> +
> +    /**
> +     * OpenOffice server generates redundant paragraphs (with empty  
> content) to achieve spacing.
> +     * These paragraphs should be stripped off / replaced with  
> [EMAIL PROTECTED] <br/>} elements appropriately
> +     * because otherwise they result in spurious [EMAIL PROTECTED] (%%)}  
> elements in generated xwiki content.
> +     *
> +     * @param document The html document.
> +     */
> +    private void filterEmptyParagraphs(Document document)
> +    {
> +        NodeList paragraphs = document.getElementsByTagName("p");
> +        for (int i = 0; i < paragraphs.getLength(); i++) {
> +            Element paragraph = (Element) paragraphs.item(i);
> +            if (paragraph.getTextContent().trim().equals("")) {
> +                // We suspect this is an empty paragraph but it is  
> possible that it contains other
> +                // non-textual tags like images. For the moment  
> we'll only search for internal image
> +                // tags, we might have to refine this criterion  
> later.
> +                NodeList internalImages =  
> paragraph.getElementsByTagName("img");
> +                if (internalImages.getLength() == 0) {
> +                    paragraph.getParentNode().removeChild(paragraph);
> +                    i--;
> +                }
> +            }
> +        }
> +    }
> }
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to