New submission from Hoang Duy Tran <hoangduytran1...@gmail.com>:
I have been working with some 'difficult' HTML files generated by Sphinx's RST. The following block of text is the RST original content: ---------------------------------------------------- Animation Playback Options ========================== ``-a`` ``<options>`` ``<file(s)>`` Playback ``<file(s)>``, only operates this way when not running in background. ``-p`` ``<sx>`` ``<sy>`` Open with lower left corner at ``<sx>``, ``<sy>``. ``-m`` Read from disk (Do not buffer). ``-f`` ``<fps>`` ``<fps-base>`` Specify FPS to start with. ``-j`` ``<frame>`` Set frame step to ``<frame>``. ``-s`` ``<frame>`` Play from ``<frame>``. ``-e`` ``<frame>`` Play until ``<frame>``. ---------------------------------------------------- This is the HTML block that is generated by Sphinx: ---------------------------------------------------- <section ids="animation-playback-options" names="animation\ playback\ options"><title>Animation Playback Options</title><definition_list><definition_list_item><term><literal>-a</literal> <literal><options></literal> <literal><file(s)></literal></term><definition><paragraph>Playback <literal><file(s)></literal>, only operates this way when not running in background.</paragraph><definition_list><definition_list_item><term><literal>-p</literal> <literal><sx></literal> <literal><sy></literal></term><definition><paragraph>Open with lower left corner at <literal><sx></literal>, <literal><sy></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-m</literal></term><definition><paragraph>Read from disk (Do not buffer).</paragraph></definition></definition_list_item><definition_list_item><term><literal>-f</literal> <literal><fps></literal> <literal><fps-base></literal></term><definition><paragraph>Specify FPS to start with.</paragraph></definition></de finition_list_item><definition_list_item><term><literal>-j</literal> <literal><frame></literal></term><definition><paragraph>Set frame step to <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-s</literal> <literal><frame></literal></term><definition><paragraph>Play from <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-e</literal> <literal><frame></literal></term><definition><paragraph>Play until <literal><frame></literal>.</paragraph></definition></definition_list_item></definition_list></definition></definition_list_item></definition_list></section> ---------------------------------------------------- I then use the BeautifulSoup, which uses the HTMLParser, to beautify and parse the HTML document and I've noticed that every incident of data that leads with a "<" and ends with ">", for example: <options> <file(s)> .... has been misunderstood by the HTMLParser's library as a TAG, and then it INVENTS a CLOSED TAGS for it ie. <literal> <options> </options> </literal> and <literal> <file(s)> </file(s)> </literal> which when reversing, ie. turning from HTML back to normal text, muted out the original data, leading to TRUNCATION/LOST of DATA. Here is the content of the beautify generated data, issue lines are marked with '#**************************' to make it easier for you to identify. ---------------------------------------------------- <section ids="animation-playback-options" names="animation\ playback\ options"> <title> Animation Playback Options </title> <definition_list> <definition_list_item> <term> <literal> -a </literal> <literal> <options> #************************** </options> #************************** </literal> <literal> <file(s)> #************************** </file(s)> #************************** </literal> </term> <definition> <paragraph> Playback <literal> <file(s)> #************************** </file(s)> #************************** </literal> , only operates this way when not running in background. </paragraph> <definition_list> <definition_list_item> <term> <literal> -p </literal> <literal> <sx> #************************** </sx> #************************** </literal> <literal> <sy> #************************** </sy> #************************** </literal> </term> <definition> <paragraph> Open with lower left corner at <literal> <sx> #************************** </sx> #************************** </literal> , <literal> <sy> #************************** </sy> #************************** </literal> . </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -m </literal> </term> <definition> <paragraph> Read from disk (Do not buffer). </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -f </literal> <literal> <fps> #************************** </fps> #************************** </literal> <literal> <fps-base> #************************** </fps-base> #************************** </literal> </term> <definition> <paragraph> Specify FPS to start with. </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -j </literal> <literal> <frame/> #************************** </literal> </term> <definition> <paragraph> Set frame step to <literal> <frame/> #************************** </literal> . </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -s </literal> <literal> <frame/> #************************** </literal> </term> <definition> <paragraph> Play from <literal> <frame/> #************************** </literal> . </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -e </literal> <literal> <frame/> #************************** </literal> </term> <definition> <paragraph> Play until <literal> <frame/> #************************** </literal> . </paragraph> </definition> </definition_list_item> </definition_list> </definition> </definition_list_item> </definition_list> </section> ---------------------------------------------------- I enclosed the HTML file generated by Sphinx to allow you test this issue with the actual data. Here is the URL of the HTML file: https://docs.blender.org/manual/en/dev/advanced/command_line/arguments.html Kind Regards, Hoang Tran ---------- components: Library (Lib) files: arguments.html messages: 343724 nosy: htran priority: normal severity: normal status: open title: HTMLParser mistakenly inventing new tags while parsing type: behavior versions: Python 3.6 Added file: https://bugs.python.org/file48367/arguments.html _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue37071> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com