Hi Craig,

thanks for your reply, I now do feel that we are in agreement :)

On one point I'll have to destroy your hopes though I am afraid.
Automatically converting from pptx to canonical automatically will be
pretty close to impossible I'm afraid. Parsing the Powerpoint files would
be possible, there are libraries for that. Converting a pure text slide
should also be possible unless there are five text boxes spread across the
slide and the placement is relevant. But as soon as you add shapes,
connections between shapes, transitions, pretty much anything beyond text
to the picture I'm afraid the effort would far outweigh the benefits - if
it were even possible.
Maybe some of our machine learning brethren can at some point use our
efforts to train a model for this, once we've converted a lot of slides :)

Diffing actually is far easier, as Powerpoint has a function for that, but
this only gives you rough pointers as to what's changed. But from a quick
test I just did it at least tells you which slide changed, maybe that's all
we need as input for the people converting things.

One sentence from your mail made me think a little. You said "Apache policy
is "no binaries in releases" which is commonly understood to mean that
releases contain the sources to build the release and not the result of
compiling" - would this, for us, mean that we cannot distribute HTML slides
as releases but have to ship Asciidoc material and the means to create
slides from those?

Best regards,
Sönke

On Fri, Mar 15, 2019 at 4:21 PM Craig Russell <apache....@gmail.com> wrote:

> Hi Sönke,
>
> > On Mar 15, 2019, at 1:02 AM, Sönke Liebau 
> > <soenke.lie...@opencore.com.INVALID>
> wrote:
> >
> >> If a work is directly editable, it is not binary. So I'd argue (with
> >> anyone who has a different opinion ;-) that files with .ppt, .odt, and
> >> other file types are source files because there are editors for them.
>
> > Happy to argue :)
> > While you are right in principle, have you ever tried to diff a pptx file
> > with git? At some point during the initial discussions for this project,
> > someone stated something along the lines of "while pptx is xml under the
> > hood this is so complex and un-diffable that they can, in essence, be
> > treated as binary files" - I just sort of took that as gospel when
> writing
> > this.
>
> So now we should probably separate policy and logistics.
>
> Apache policy is "no binaries in releases" which is commonly understood to
> mean that releases contain the sources to build the release, and not the
> result of compiling. In this sense, contributed pptx is not binary because
> it is not the result of processing anything. So there is no issue with
> having releases of these files if we choose to release them.
>
> But I agree with you about logistics. A contributed pptx will be processed
> into a canonical form and updates should then be done on the canonical
> form.
> >
> > Again, I am absolutely not fundamentally disagreeing with you, I agree
> that
> > material in pptx can be useful, should be part of the indexed content,
> > properly tagged and all that. But, I do think that the overall target
> > should be to convert everything that is donated into our canonical (and
> > still to be defined) format eventually and updates done directly in
> there.
> > During that conversion, I think there may be stumbling stones if updates
> to
> > pptx content come in.
>
> I was hoping that the conversion from pptx to canonical form is mostly
> done automatically. I have no experience here, just hope. But I have no
> hope that we can diff pptx files meaningfully.
>
> Craig
> >
> >>>
> >>> That is all manageable I'm sure, but we should put some thought into
> >>> it up front I think.
> >>>
> >>> Best regards,
> >>> Sönke
> >>>
> >
>
> Craig L Russell
> c...@apache.org
>
>

-- 
Sönke Liebau
Partner
Tel. +49 179 7940878
OpenCore GmbH & Co. KG - Thomas-Mann-Straße 8 - 22880 Wedel - Germany

Reply via email to