cleaner

Vincent Massol Tue, 24 Feb 2009 10:52:23 -0800

On Feb 24, 2009, at 7:24 PM, Sergiu Dumitriu wrote:

> Vincent Massol wrote:
>> On Feb 24, 2009, at 4:48 PM, Sergiu Dumitriu wrote:
>>
>>> Asiri Rathnayake wrote:
>>>> Hi Vincent,
>>>>
>>>>> But the story
>>>>>> is different for OO generated html which puts a paragraph element
>>>>>> when there
>>>>>> shouldn't be one.
>>>>> I don't agree since it's very valid to have <p> inside cells and
>>>>> not a
>>>>> OO problem.
>>>>
>>>> It's very valid to have <p> elements inside table cells. But my
>>>> point is
>>>> this:
>>>>
>>>> The original word document when viewed through _oo writer_ displays
>>>> content
>>>> within table cells with a particular size. But when saved as html
>>>> and viewed
>>>> from a browser, the same table cell becomes enlarged. And this is
>>>> because
>>>> there is a paragraph element inside each table cell element
>>>> generated by oo
>>>> html generator.
>>>>
>>>> Now, since we wanted officeimporter to generate wiki content that
>>>> would
>>>> ultimately render an output which looks close to the original
>>>> document, i
>>>> decided to strip the paragraph element (to make it look smaller and
>>>> close to
>>>> the sizing of original document rendered in oo writer)
>>>>
>>>> But if it's only a matter of convension (wiki is wiki, office is
>>>> office) and
>>>> the paragraph should be left alone I can make that chage easily.
>>>>
>>>> WDYT?
>>>>
>>> I for one prefer removing the paragraph. For me, this is clearly  
>>> an OO
>>> shortcoming. Vincent, the idea is not about paragraphs inside table
>>> cells in general, but about this particular paragraph that obviously
>>> shouldn't be there. The HTML generated by OO is just an  
>>> intermediary,
>>> we're not interested in keeping it as much as possible in the  
>>> wiki, we
>>> just want to extract the data from it and convert it to wiki syntax.
>>> The
>>> Office importer transforms office documents to wiki documents, and  
>>> not
>>> HTML to wiki. OO wrongly puts paragraphs in there, and the fact that
>>> the
>>> same HTML looks much different in a browser than the document  
>>> looks in
>>> OO is a good enough argument, IMO.
>>
>> This is generic and not specific to OO. HTML allows puttings one or
>> several paragraphs in table cells, list item,etc so we need to handle
>> those, independently of OO.
>> If we handle it at the rendering module level then it fixes both OO
>> and direct HTML input.
>
> No. We should not strip all the paragraphs that are found inside table
> cells.


I've never said this! What I told Asiri is that the XHTML parser  
should generate the following events:

beginCell + beginDocument + beginPara + onWord(sometext) + endPara +  
endDocument + endCell.

> Maybe the user wants those there.

I don't agree. We're making transformation and we're not leaving the  
user content untouched. For example if the user enters "**hello" it'll  
get converted to "**hello**". There are several cases where we're  
transforming what the user enters.

Here I'm proposing that the XWiki Syntax Renderer transforms the  
events above into:

| sometext

instead of:

| (((sometext)))

> But we know for sure that the
> _intermediary_ HTML generated by OO contains Ps where it shouldn't. It
> is specific. In general we should respect the markup, but in this
> specific case it is just a workaround for a third party bug. HTMLs
> generated by office suites is messy in general. I for one really hate
> the bulky sh1t that MS Word names HTML.

I still don't agree. See above.

Thanks
-Vincent

_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [xwiki-notifications] r16999 - platform/core/trunk/xwiki-officeimporter/src/test/java/org/xwiki/officeimporter/internal/cleaner

Reply via email to