Le 13/04/2016 14:43, NP a écrit :
Is there anyway to convert pre-existing HTML/XML/PDF files to ADOC??
TL;DR: HTML->ADOC yes via Pandoc. Overall, there are tools, but the
added value of tools for that job mostly depends on the amount of work:
number*lengths of documents you have to convert.
HTML? Probably yes
Yes for HTML. Assuming it's a real sane document, not a bag of
javascript. :-)
Enters Pandoc, a many-to-many markup format converter that can produce
asciidoc.
You can even perform conversion online, at least for a test: Example
conversion
<http://pandoc.org/try/?text=%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A&from=html&to=asciidoc>.
If can even do the reverse conversion via docbook: convert your asciidoc
to docbook then pandoc can turn it into any of the other supported formats.
XML? Probably yes with some work.
For XML it probably depends on your intent.
Given that XML semantics are context dependent, a turnkey solution might
not exist.
Depending on your situation, you may have some XSL stylesheets that turn
the XML into HTML.
If this occurred to me I would try that, or write a short
problem-specific XSL that would turn the XML directly into asciidoc.
PDF? Not that I know.
As far as I know, PDF is (mostly) a end-of-process, appearance-oriented
format.
PDF content does not keep structure like chapter titles, sections, etc.
There might be some exceptions.
So, in general there's no solution.
Perhaps find an existing PDF-to-HTML converter that tries to guess some
markup from that, and turn that HTML to asciidoc, then edit by hand.
Failing that, you can try to transform your PDF to text by copy-paste
into a text editor, or batch tools like pdftoascii.
Even ordering in a PDF is sometimes shuffled. As a result, trying to
select some text for copy-paste sometimes yields surprises. For example,
a two-column layout PDF will select from both columns when you select
more than a fraction of a line.
From http://pandoc.org/ :
Pandoc can convert documents in markdown
<http://daringfireball.net/projects/markdown/>, reStructuredText
<http://docutils.sourceforge.net/docs/ref/rst/introduction.html>,
textile <http://redcloth.org/textile>, HTML
<http://www.w3.org/TR/html40/>, DocBook <http://www.docbook.org/>,
LaTeX <http://www.latex-project.org/>, MediaWiki markup
<http://www.mediawiki.org/wiki/Help:Formatting>, TWiki markup
<http://twiki.org/cgi-bin/view/TWiki/TextFormattingRules>, OPML
<http://dev.opml.org/spec2.html>, Emacs Org-Mode <http://orgmode.org>,
Txt2Tags <http://txt2tags.org/>, Microsoft Word docx
<http://www.microsoft.com/interop/openup/openxml/default.aspx>,
LibreOffice ODT <http://en.wikipedia.org/wiki/OpenDocument>, EPUB
<http://en.wikipedia.org/wiki/EPUB>, or Haddock markup
<http://www.haskell.org/haddock/doc/html/ch03s08.html> to
* HTML formats: XHTML, HTML5, and HTML slide shows using Slidy
<http://www.w3.org/Talks/Tools/Slidy>, reveal.js
<http://lab.hakim.se/reveal-js/>, Slideous
<http://goessner.net/articles/slideous/>, S5
<http://meyerweb.com/eric/tools/s5/>, or DZSlides
<http://paulrouget.com/dzslides/>.
* Word processor formats: Microsoft Word docx
<http://www.microsoft.com/interop/openup/openxml/default.aspx>,
OpenOffice/LibreOffice ODT
<http://en.wikipedia.org/wiki/OpenDocument>, OpenDocument XML
<http://opendocument.xml.org/>
* Ebooks: EPUB <http://en.wikipedia.org/wiki/EPUB> version 2 or 3,
FictionBook2
<http://www.fictionbook.org/index.php/Eng:XML_Schema_Fictionbook_2.1>
* Documentation formats: DocBook <http://www.docbook.org/>, TEI
Simple <https://github.com/TEIC/TEI-Simple>, GNU TexInfo
<http://www.gnu.org/software/texinfo/>, Groff man
<http://www.gnu.org/software/groff/groff.html> pages, Haddock
markup <http://www.haskell.org/haddock/doc/html/ch03s08.html>
* Page layout formats: InDesign ICML
<https://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs55-docs/IDML/idml-specification.pdf>
* Outline formats: OPML <http://dev.opml.org/spec2.html>
* TeX formats: LaTeX <http://www.latex-project.org/>, ConTeXt
<http://www.pragma-ade.nl/>, LaTeX Beamer slides
* PDF <http://en.wikipedia.org/wiki/Portable_Document_Format> via LaTeX
* Lightweight markup formats: Markdown
<http://daringfireball.net/projects/markdown/> (including
CommonMark <http://commonmark.org>), reStructuredText
<http://docutils.sourceforge.net/docs/ref/rst/introduction.html>,
AsciiDoc <http://www.methods.co.nz/asciidoc/>, MediaWiki markup
<http://www.mediawiki.org/wiki/Help:Formatting>, DokuWiki markup
<https://www.dokuwiki.org/wiki:syntax>, Emacs Org-Mode
<http://orgmode.org>, Textile <http://redcloth.org/textile>
* Custom formats: custom writers can be written in lua
<http://www.lua.org>.
--
Stéphane Gourichon
--
You received this message because you are subscribed to the Google Groups
"asciidoc" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/asciidoc.
For more options, visit https://groups.google.com/d/optout.