PDF files to ADOC

Stéphane Gourichon Wed, 13 Apr 2016 07:39:13 -0700

Le 13/04/2016 14:43, NP a écrit :

Is there anyway to convert pre-existing HTML/XML/PDF files to ADOC??

TL;DR: HTML->ADOC yes via Pandoc. Overall, there are tools, but theadded value of tools for that job mostly depends on the amount of work:number*lengths of documents you have to convert.



   HTML? Probably yes

Yes for HTML. Assuming it's a real sane document, not a bag ofjavascript. :-)

Enters Pandoc, a many-to-many markup format converter that can produceasciidoc.

You can even perform conversion online, at least for a test: Exampleconversion<http://pandoc.org/try/?text=%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Ch1%3EYay%2C+title%21%3C%2Fh1%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A%3Cp%3EText%3C%2Fp%3E%0A&from=html&to=asciidoc>.

If can even do the reverse conversion via docbook: convert your asciidocto docbook then pandoc can turn it into any of the other supported formats.



   XML? Probably yes with some work.


For XML it probably depends on your intent.

Given that XML semantics are context dependent, a turnkey solution mightnot exist.Depending on your situation, you may have some XSL stylesheets that turnthe XML into HTML.If this occurred to me I would try that, or write a shortproblem-specific XSL that would turn the XML directly into asciidoc.



   PDF? Not that I know.

As far as I know, PDF is (mostly) a end-of-process, appearance-orientedformat.PDF content does not keep structure like chapter titles, sections, etc.There might be some exceptions.

So, in general there's no solution.

Perhaps find an existing PDF-to-HTML converter that tries to guess somemarkup from that, and turn that HTML to asciidoc, then edit by hand.

Failing that, you can try to transform your PDF to text by copy-pasteinto a text editor, or batch tools like pdftoascii.

Even ordering in a PDF is sometimes shuffled. As a result, trying toselect some text for copy-paste sometimes yields surprises. For example,a two-column layout PDF will select from both columns when you selectmore than a fraction of a line.





From http://pandoc.org/ :

Pandoc can convert documents in markdown<http://daringfireball.net/projects/markdown/>, reStructuredText<http://docutils.sourceforge.net/docs/ref/rst/introduction.html>,textile <http://redcloth.org/textile>, HTML<http://www.w3.org/TR/html40/>, DocBook <http://www.docbook.org/>,LaTeX <http://www.latex-project.org/>, MediaWiki markup<http://www.mediawiki.org/wiki/Help:Formatting>, TWiki markup<http://twiki.org/cgi-bin/view/TWiki/TextFormattingRules>, OPML<http://dev.opml.org/spec2.html>, Emacs Org-Mode <http://orgmode.org>,Txt2Tags <http://txt2tags.org/>, Microsoft Word docx<http://www.microsoft.com/interop/openup/openxml/default.aspx>,LibreOffice ODT <http://en.wikipedia.org/wiki/OpenDocument>, EPUB<http://en.wikipedia.org/wiki/EPUB>, or Haddock markup<http://www.haskell.org/haddock/doc/html/ch03s08.html> to


  * HTML formats: XHTML, HTML5, and HTML slide shows using Slidy
    <http://www.w3.org/Talks/Tools/Slidy>, reveal.js
    <http://lab.hakim.se/reveal-js/>, Slideous
    <http://goessner.net/articles/slideous/>, S5
    <http://meyerweb.com/eric/tools/s5/>, or DZSlides
    <http://paulrouget.com/dzslides/>.
  * Word processor formats: Microsoft Word docx
    <http://www.microsoft.com/interop/openup/openxml/default.aspx>,
    OpenOffice/LibreOffice ODT
    <http://en.wikipedia.org/wiki/OpenDocument>, OpenDocument XML
    <http://opendocument.xml.org/>
  * Ebooks: EPUB <http://en.wikipedia.org/wiki/EPUB> version 2 or 3,
    FictionBook2
    <http://www.fictionbook.org/index.php/Eng:XML_Schema_Fictionbook_2.1>
  * Documentation formats: DocBook <http://www.docbook.org/>, TEI
    Simple <https://github.com/TEIC/TEI-Simple>, GNU TexInfo
    <http://www.gnu.org/software/texinfo/>, Groff man
    <http://www.gnu.org/software/groff/groff.html> pages, Haddock
    markup <http://www.haskell.org/haddock/doc/html/ch03s08.html>
  * Page layout formats: InDesign ICML
    
<https://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs55-docs/IDML/idml-specification.pdf>
  * Outline formats: OPML <http://dev.opml.org/spec2.html>
  * TeX formats: LaTeX <http://www.latex-project.org/>, ConTeXt
    <http://www.pragma-ade.nl/>, LaTeX Beamer slides
  * PDF <http://en.wikipedia.org/wiki/Portable_Document_Format> via LaTeX
  * Lightweight markup formats: Markdown
    <http://daringfireball.net/projects/markdown/> (including
    CommonMark <http://commonmark.org>), reStructuredText
    <http://docutils.sourceforge.net/docs/ref/rst/introduction.html>,
    AsciiDoc <http://www.methods.co.nz/asciidoc/>, MediaWiki markup
    <http://www.mediawiki.org/wiki/Help:Formatting>, DokuWiki markup
    <https://www.dokuwiki.org/wiki:syntax>, Emacs Org-Mode
    <http://orgmode.org>, Textile <http://redcloth.org/textile>
  * Custom formats: custom writers can be written in lua
    <http://www.lua.org>.




--
Stéphane Gourichon

--
You received this message because you are subscribed to the Google Groups 
"asciidoc" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/asciidoc.
For more options, visit https://groups.google.com/d/optout.

Re: Converting Pre-existing HTML/XML/PDF files to ADOC

Reply via email to