Hi Asiri,
Are you sure this should go in the office importer and not in the HTML
cleaner or in the XHMTL parser?
Thanks
-Vincent
On Nov 24, 2008, at 6:08 PM, asiri (SVN) wrote:
> Author: asiri
> Date: 2008-11-24 18:08:51 +0100 (Mon, 24 Nov 2008)
> New Revision: 14425
>
> Modified:
> sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
> plugin/officeimporter/filter/RedundantTagFilter.java
> Log:
> XAOFFICE-1 : Develop the initial feature set for office-importer
> plugin.
>
> * Added support for filtering empty / redundant paragraphs.
>
> Modified: sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/
> xwiki/plugin/officeimporter/filter/RedundantTagFilter.java
> ===================================================================
> --- sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
> plugin/officeimporter/filter/RedundantTagFilter.java 2008-11-24
> 15:17:17 UTC (rev 14424)
> +++ sandbox/xwiki-plugin-officeimporter/src/main/java/com/xpn/xwiki/
> plugin/officeimporter/filter/RedundantTagFilter.java 2008-11-24
> 17:08:51 UTC (rev 14425)
> @@ -31,12 +31,13 @@
>
> public void filter(Document document, ImporterContext context)
> {
> - for(String key : attributeWiseFilteredTags) {
> + for (String key : attributeWiseFilteredTags) {
>
> filterNodesWithZeroAttributes(document.getElementsByTagName(key));
> }
> - for(String key : contentWiseFilteredTags) {
> + for (String key : contentWiseFilteredTags) {
>
> filterNodesWithEmptyTextContent(document.getElementsByTagName(key));
> - }
> + }
> + filterEmptyParagraphs(document);
> }
>
> /**
> @@ -70,10 +71,35 @@
> {
> for (int i = 0; i < elements.getLength(); i++) {
> Element element = (Element) elements.item(i);
> - if (element.getTextContent().trim().equals("")) {
> + if (element.getTextContent().trim().equals("")) {
> element.getParentNode().removeChild(element);
> i--;
> }
> }
> }
> +
> + /**
> + * OpenOffice server generates redundant paragraphs (with empty
> content) to achieve spacing.
> + * These paragraphs should be stripped off / replaced with
> [EMAIL PROTECTED] <br/>} elements appropriately
> + * because otherwise they result in spurious [EMAIL PROTECTED] (%%)}
> elements in generated xwiki content.
> + *
> + * @param document The html document.
> + */
> + private void filterEmptyParagraphs(Document document)
> + {
> + NodeList paragraphs = document.getElementsByTagName("p");
> + for (int i = 0; i < paragraphs.getLength(); i++) {
> + Element paragraph = (Element) paragraphs.item(i);
> + if (paragraph.getTextContent().trim().equals("")) {
> + // We suspect this is an empty paragraph but it is
> possible that it contains other
> + // non-textual tags like images. For the moment
> we'll only search for internal image
> + // tags, we might have to refine this criterion
> later.
> + NodeList internalImages =
> paragraph.getElementsByTagName("img");
> + if (internalImages.getLength() == 0) {
> + paragraph.getParentNode().removeChild(paragraph);
> + i--;
> + }
> + }
> + }
> + }
> }
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs