Hi Dhanesh, Wow! Thanks for sharing about Excalibur. The column separator feature meets a long time need !
Note for users : Set the "Flavor" dropdown on right side to "Stream" to be able to use the column separator. -Nikhil On Saturday, June 8, 2019 at 2:16:02 AM UTC+5:30, Dhanesh B. Sabane wrote: > > Hello Karthik and Nikhil, > > On 05/06/19 8:52 pm, Nikhil VJ wrote: > > Hi Karthik, > > > > Answering your second question: what's the best way to read the data? > > > > I recommend using Tabula - "for liberating data tables locked inside PDF > > files." > > https://tabula.technology/ > > > > With this you can go to a specific page, select the area of the table > > with mouse to exclude unnecessary things, and extract the data to a > CSV. > > > > There are two ways it uses to extract, so be sure to try the other if > > one doesn't work out for you. > > > > And some manual work may be needed after extraction in case there's > > extra spaces, line-breaks in the header, etc. > > > > Note: If the table you need is a scanned image in this PDF, then tabula > > is not applicable. Only works with vector data. > > > > While Tabula is a great tool to extract tabular data from PDFs (we rely > on it for quite a few tasks in my company), sometimes it fails to > correctly extract tabular data in a way that can be easily used by the > user. Another tool that we use, in such cases, is "Camelot: PDF Table > Extraction for Humans" > > https://camelot-py.readthedocs.io/ > > I created a sample CSV document for the table on page 131 of the report > that you linked. Please find it attached. > > Similar to Tabula, Camelot also has a web interface that you can use to > select particular area of the table and get a CSV. > > https://www.tryexcalibur.com/ > > I hope this helps! Feel free to reach me if you have any queries and > would like some assistance along the way. > > Cheers! :) > > -- > Dhanesh B. Sabane > https://dhanesh95.gitlab.io > PGP ID: 0xB69A98C9C1642329 > Fingerprint: 9655 11F2 0D18 E76A 2396 D64D B69A 98C9 C164 2329 > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/3a386c56-6e63-44cf-9f0a-d27e2a03f137%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
