On Fri, 29 Mar 2019 at 01:54, Dave Fisher <dave2w...@comcast.net> wrote: > > > > Sent from my iPhone > > >> On Mar 28, 2019, at 5:51 PM, sebb <seb...@gmail.com> wrote: > >> > >>> On Thu, 28 Mar 2019 at 23:54, sebb <seb...@gmail.com> wrote: > >>> > >>> On Thu, 28 Mar 2019 at 23:42, sebb <seb...@gmail.com> wrote: > >>> > >>> On Thu, 28 Mar 2019 at 16:14, Bertrand Delacretaz > >>> <bdelacre...@codeconsult.ch> wrote: > >>>> > >>>> Hi, > >>>> > >>>>> On Thu, Mar 28, 2019 at 4:14 PM sebb <seb...@gmail.com> wrote: > >>>>> ...OK to add the missing pages to the archive?... > >>>> > >>>> Sure - I did that quickly, too quickly apparently, thanks for fixing! > >>> > >>> There's another issue. > >>> > >>> wget by default downloads files with no extensions. > >>> > >>> However there are some directories which have the same names as files, > >>> e.g. > >>> > >>> https://wiki.apache.org/incubator/OpenOfficeProposal > >>> https://wiki.apache.org/incubator/OpenOfficeProposal/ja > >>> and > >>> https://wiki.apache.org/incubator/ProjectProposals > >>> https://wiki.apache.org/incubator/ProjectProposals/TamayaProposal > >>> > >>> By the way there is also another TamayaProposal: > >>> > >>> https://wiki.apache.org/incubator/TamayaProposal > >>> > >>> Does the PMC want to fix these two clashes? > >>> An alternative is to rename the files as .html. > >> > >> Turns out there are several more clashes which don't have obvious > >> fixes, so I will rename as .html. > > > > OK, done, and the filenames that clashed have been added to SVN. > > > > Would it be useful to also save the raw text versions of the pages? > > My Ruby script can easily download these. > > It looks like the content that we care about is in the div with id=“page”.
Yes, the Ruby script starts there when scanning TitleIndex for page links. This includes all titles except FrontPage. I've committed an example page to show the difference: QpidProposal_body.html It should be relatively easy to strip out the header and footer if required, as the surrounding header and footer have a standard layout. [No need to parse the page as HTML] > > > >> However it might be worth dropping the older TamayaProposal, i.e. the > >> one under ProjectProposals > > Yes. I'll leave that as an exercise for the reader(s). > Regards, > Dave > >> > >>>> -Bertrand > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>>> For additional commands, e-mail: general-h...@incubator.apache.org > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org