Re: [datameet] Re: PLFS Data in easy-to-read format

Nikhil VJ Fri, 07 Jun 2019 22:03:58 -0700

Hi Dhanesh,

Wow! Thanks for sharing about Excalibur. The column separator feature meets 
a long time need !


Note for users : Set the "Flavor" dropdown on right side to "Stream" to be 
able to use the column separator.

-Nikhil

On Saturday, June 8, 2019 at 2:16:02 AM UTC+5:30, Dhanesh B. Sabane wrote:
>
> Hello Karthik and Nikhil, 
>
> On 05/06/19 8:52 pm, Nikhil VJ wrote: 
> > Hi Karthik, 
> > 
> > Answering your second question: what's the best way to read the data?  
> > 
> > I recommend using Tabula - "for liberating data tables locked inside PDF 
> > files." 
> > https://tabula.technology/ 
> > 
> > With this you can go to a specific page, select the area of the table 
> > with mouse to exclude unnecessary things, and extract the data to a 
> CSV.  
> > 
> > There are two ways it uses to extract, so be sure to try the other if 
> > one doesn't work out for you.  
> > 
> > And some manual work may be needed after extraction in case there's 
> > extra spaces, line-breaks in the header, etc. 
> > 
> > Note: If the table you need is a scanned image in this PDF, then tabula 
> > is not applicable. Only works with vector data. 
> > 
>
> While Tabula is a great tool to extract tabular data from PDFs (we rely 
> on it for quite a few tasks in my company), sometimes it fails to 
> correctly extract tabular data in a way that can be easily used by the 
> user. Another tool that we use, in such cases, is "Camelot: PDF Table 
> Extraction for Humans" 
>
> https://camelot-py.readthedocs.io/ 
>
> I created a sample CSV document for the table on page 131 of the report 
> that you linked. Please find it attached. 
>
> Similar to Tabula, Camelot also has a web interface that you can use to 
> select particular area of the table and get a CSV. 
>
> https://www.tryexcalibur.com/ 
>
> I hope this helps! Feel free to reach me if you have any queries and 
> would like some assistance along the way. 
>
> Cheers! :) 
>
> -- 
> Dhanesh B. Sabane 
> https://dhanesh95.gitlab.io 
> PGP ID: 0xB69A98C9C1642329 
> Fingerprint: 9655 11F2 0D18 E76A 2396 D64D B69A 98C9 C164 2329 
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/datameet/3a386c56-6e63-44cf-9f0a-d27e2a03f137%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [datameet] Re: PLFS Data in easy-to-read format

Reply via email to