Le 13/04/2016 14:43, NP a écrit :
Is there anyway to convert pre-existing HTML/XML/PDF files to ADOC??

TL;DR: HTML->ADOC yes via Pandoc. Overall, there are tools, but the added value of tools for that job mostly depends on the amount of work: number*lengths of documents you have to convert.


   HTML? Probably yes


Yes for HTML. Assuming it's a real sane document, not a bag of javascript. :-)

Enters Pandoc, a many-to-many markup format converter that can produce asciidoc.

You can even perform conversion online, at least for a test: Example conversion <http://pandoc.org/try/?text=%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A&from=html&to=asciidoc>.

If can even do the reverse conversion via docbook: convert your asciidoc to docbook then pandoc can turn it into any of the other supported formats.


   XML? Probably yes with some work.


For XML it probably depends on your intent.
Given that XML semantics are context dependent, a turnkey solution might not exist. Depending on your situation, you may have some XSL stylesheets that turn the XML into HTML. If this occurred to me I would try that, or write a short problem-specific XSL that would turn the XML directly into asciidoc.


   PDF? Not that I know.

As far as I know, PDF is (mostly) a end-of-process, appearance-oriented format. PDF content does not keep structure like chapter titles, sections, etc. There might be some exceptions.
So, in general there's no solution.
Perhaps find an existing PDF-to-HTML converter that tries to guess some markup from that, and turn that HTML to asciidoc, then edit by hand.

Failing that, you can try to transform your PDF to text by copy-paste into a text editor, or batch tools like pdftoascii.

Even ordering in a PDF is sometimes shuffled. As a result, trying to select some text for copy-paste sometimes yields surprises. For example, a two-column layout PDF will select from both columns when you select more than a fraction of a line.




From http://pandoc.org/ :

Pandoc can convert documents in markdown <http://daringfireball.net/projects/markdown/>, reStructuredText <http://docutils.sourceforge.net/docs/ref/rst/introduction.html>, textile <http://redcloth.org/textile>, HTML <http://www.w3.org/TR/html40/>, DocBook <http://www.docbook.org/>, LaTeX <http://www.latex-project.org/>, MediaWiki markup <http://www.mediawiki.org/wiki/Help:Formatting>, TWiki markup <http://twiki.org/cgi-bin/view/TWiki/TextFormattingRules>, OPML <http://dev.opml.org/spec2.html>, Emacs Org-Mode <http://orgmode.org>, Txt2Tags <http://txt2tags.org/>, Microsoft Word docx <http://www.microsoft.com/interop/openup/openxml/default.aspx>, LibreOffice ODT <http://en.wikipedia.org/wiki/OpenDocument>, EPUB <http://en.wikipedia.org/wiki/EPUB>, or Haddock markup <http://www.haskell.org/haddock/doc/html/ch03s08.html> to

  * HTML formats: XHTML, HTML5, and HTML slide shows using Slidy
    <http://www.w3.org/Talks/Tools/Slidy>, reveal.js
    <http://lab.hakim.se/reveal-js/>, Slideous
    <http://goessner.net/articles/slideous/>, S5
    <http://meyerweb.com/eric/tools/s5/>, or DZSlides
    <http://paulrouget.com/dzslides/>.
  * Word processor formats: Microsoft Word docx
    <http://www.microsoft.com/interop/openup/openxml/default.aspx>,
    OpenOffice/LibreOffice ODT
    <http://en.wikipedia.org/wiki/OpenDocument>, OpenDocument XML
    <http://opendocument.xml.org/>
  * Ebooks: EPUB <http://en.wikipedia.org/wiki/EPUB> version 2 or 3,
    FictionBook2
    <http://www.fictionbook.org/index.php/Eng:XML_Schema_Fictionbook_2.1>
  * Documentation formats: DocBook <http://www.docbook.org/>, TEI
    Simple <https://github.com/TEIC/TEI-Simple>, GNU TexInfo
    <http://www.gnu.org/software/texinfo/>, Groff man
    <http://www.gnu.org/software/groff/groff.html> pages, Haddock
    markup <http://www.haskell.org/haddock/doc/html/ch03s08.html>
  * Page layout formats: InDesign ICML
    
<https://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs55-docs/IDML/idml-specification.pdf>
  * Outline formats: OPML <http://dev.opml.org/spec2.html>
  * TeX formats: LaTeX <http://www.latex-project.org/>, ConTeXt
    <http://www.pragma-ade.nl/>, LaTeX Beamer slides
  * PDF <http://en.wikipedia.org/wiki/Portable_Document_Format> via LaTeX
  * Lightweight markup formats: Markdown
    <http://daringfireball.net/projects/markdown/> (including
    CommonMark <http://commonmark.org>), reStructuredText
    <http://docutils.sourceforge.net/docs/ref/rst/introduction.html>,
    AsciiDoc <http://www.methods.co.nz/asciidoc/>, MediaWiki markup
    <http://www.mediawiki.org/wiki/Help:Formatting>, DokuWiki markup
    <https://www.dokuwiki.org/wiki:syntax>, Emacs Org-Mode
    <http://orgmode.org>, Textile <http://redcloth.org/textile>
  * Custom formats: custom writers can be written in lua
    <http://www.lua.org>.




--
Stéphane Gourichon

--
You received this message because you are subscribed to the Google Groups 
"asciidoc" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/asciidoc.
For more options, visit https://groups.google.com/d/optout.

Reply via email to