On Fri, 29 Mar 2019 at 01:54, Dave Fisher <dave2w...@comcast.net> wrote:
>
>
>
> Sent from my iPhone
>
> >> On Mar 28, 2019, at 5:51 PM, sebb <seb...@gmail.com> wrote:
> >>
> >>> On Thu, 28 Mar 2019 at 23:54, sebb <seb...@gmail.com> wrote:
> >>>
> >>> On Thu, 28 Mar 2019 at 23:42, sebb <seb...@gmail.com> wrote:
> >>>
> >>> On Thu, 28 Mar 2019 at 16:14, Bertrand Delacretaz
> >>> <bdelacre...@codeconsult.ch> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>> On Thu, Mar 28, 2019 at 4:14 PM sebb <seb...@gmail.com> wrote:
> >>>>> ...OK to add the missing pages to the archive?...
> >>>>
> >>>> Sure - I did that quickly, too quickly apparently, thanks for fixing!
> >>>
> >>> There's another issue.
> >>>
> >>> wget by default downloads files with no extensions.
> >>>
> >>> However there are some directories which have the same names as files, 
> >>> e.g.
> >>>
> >>> https://wiki.apache.org/incubator/OpenOfficeProposal
> >>> https://wiki.apache.org/incubator/OpenOfficeProposal/ja
> >>> and
> >>> https://wiki.apache.org/incubator/ProjectProposals
> >>> https://wiki.apache.org/incubator/ProjectProposals/TamayaProposal
> >>>
> >>> By the way there is also another TamayaProposal:
> >>>
> >>> https://wiki.apache.org/incubator/TamayaProposal
> >>>
> >>> Does the PMC want to fix these two clashes?
> >>> An alternative is to rename the files as .html.
> >>
> >> Turns out there are several more clashes which don't have obvious
> >> fixes, so I will rename as .html.
> >
> > OK, done, and the filenames that clashed have been added to SVN.
> >
> > Would it be useful to also save the raw text versions of the pages?
> > My Ruby script can easily download these.
>
> It looks like the content that we care about is in the div with id=“page”.

Yes, the Ruby script starts there when scanning TitleIndex for page links.
This includes all titles except FrontPage.

I've committed an example page to show the difference:

QpidProposal_body.html

It should be relatively easy to strip out the header and footer if
required, as the surrounding header and footer have a standard layout.
[No need to parse the page as HTML]

> >
> >> However it might be worth dropping the older TamayaProposal, i.e. the
> >> one under ProjectProposals
>
> Yes.

I'll leave that as an exercise for the reader(s).

> Regards,
> Dave
> >>
> >>>> -Bertrand
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >>>> For additional commands, e-mail: general-h...@incubator.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to