Hi,

Thanks for the explanation. I do not know if Grobid can extract from text
and HTML (please look at the documentation).

P.S.
You may also explore regex for the plain text and XPath for HTML as
alternatives if GROBID doesnt work.

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Mon, Jun 12, 2017 at 3:44 AM, [email protected] <[email protected]>
wrote:

> Dear Thamme,
>
> Yes, sometime, we are unable to find a PDF of a published research
> article. The published article or part of the published article is
> sometimes available as HTML. These articles are available to me either in
> TXT or in HTML format. This is the context of my input files being TXT or
> HTML
>
> Regards,
>
>
> On Sat, Jun 10, 2017 at 10:18 AM, Thamme Gowda <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have used Grobid parser with PDF files only. I have no idea what you
>> are trying to do extract from raw text or HTML.
>>
>> Since you said:
>> 1. "I am working with published research articles using Apache Tika."
>> 2. "My input files are in TXT and HTML formats",
>>
>> Are you saying your research articles are in .txt and .html files? And
>> you are trying to extract sections such as "sections like abstract,
>> introduction, literature review,... etc"  from these files?
>>
>> *--*
>> *Thamme Gowda*
>> TG | @thammegowda <https://twitter.com/thammegowda>
>> ~Sent via somebody's Webmail server!
>>
>> On Thu, Jun 8, 2017 at 7:44 AM, [email protected] <[email protected]>
>> wrote:
>>
>>> Dear Thamme,
>>>
>>>
>>> https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf
>>>
>>> The above presentation says that Grobid supports raw text. My input
>>> files are in TXT and HTML formats. Do you have any idea how can this be
>>> supported as raw text?
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>>
>>>
>>> On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> There is a nice project called Grobid [1] that does most of what you
>>>> are describing.
>>>> Tika has Grobid parser built in (it calls grobid over REST API) .
>>>> checkout [2] for details
>>>>
>>>> I have a project that makes use of Tika with Grobid and NER support. It
>>>> also builds a search index using solr.
>>>> Check out [3] for setting up and [4] for parsing and indexing to solr
>>>> if you like to try out my python project.
>>>> Here I am able to extract title, author names, affiliations, and the
>>>> whole text of articles.
>>>> I did not extract sections within the main body of research articles.
>>>> I assume there should be a way to configure it in Grobid.
>>>>
>>>> Alternatively, if Grobid can't detect sections, you can try XHTML
>>>> content handler which preserves the basic structure of PDF file using <p>
>>>>  <br> and heading tags. So technically it should be possible to write a
>>>> wrapper to break XHTML output from tika into sections
>>>>
>>>> To get it:
>>>>
>>>> # In bash do `pip install tika’ if tika isn’t already installed
>>>> import tika
>>>> tika.initVM()
>>>> from tika import parser
>>>>
>>>>
>>>> file_path = "<pdf_dir>/2538.pdf"
>>>> data = parser.from_file(file_path, xmlContent=True)
>>>> print(data['content'])
>>>>
>>>>
>>>>
>>>>
>>>> Best,
>>>> Thamme
>>>>
>>>> [1] http://grobid.readthedocs.io/en/latest/Introduction/
>>>> [2] https://wiki.apache.org/tika/GrobidJournalParser
>>>> [3] https://github.com/USCDataScience/parser-indexer-py/tree
>>>> /master/parser-server
>>>> [4] https://github.com/USCDataScience/parser-indexer-py/blob
>>>> /master/docs/parser-index-journals.md
>>>>
>>>> *--*
>>>> *Thamme Gowda*
>>>> TG | @thammegowda <https://twitter.com/thammegowda>
>>>> ~Sent via somebody's Webmail server!
>>>>
>>>> On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working with published research articles using Apache Tika. These
>>>>> articles have distinct sections like abstract, introduction, literature
>>>>> review, methodology, experimental setup, discussion and conclusions. Is
>>>>> there some way to extract document sections with Apache Tika
>>>>>
>>>>> Regards,
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to