Hi Ranganath, You're in luck. I'll explain why at the end.
First, in your PDF if you navigate around the menu to find something like the document properties, then you will find the list of fonts somewhere. I see "Nudi Akshar-01" and "Nudi Akshar-06" fonts listed other than the regular ones. So I started searching using this string: *kannada font "nudi akshar" to unicode convert* Sharing a few potential leads for conversion: - http://aravindavk.in/projects/ : It seems he has created a converter for another Kannada font and released it on github. One can take it and change the mappings to make it work for Nudi Akshar. He has shared his email id on one of the pages. - https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/Converting_from_non_Unicode_(Nudi,_Baraha,_...)_font_encoding_to_Unicode_Kannada : I found this page mentioned in this search result which managed to come around the top of my search: https://bitbin.it/KV0Mn1x1/ ... talk about digital breadcrumbs. I advise you make an entry on the Talk page here to get in touch with others like yourself. - The wikimedia page leads to this : https://www.karnataka.gov.in/kcit/pages/kannadasoftware.aspx I'm not exploring further.. please check it out at your end. If you're more interested in just having that content read than converted to Unicode and you have some control on the places where it'll be read, then you can find and install the fonts mentioned, and share their .TTF files for installing elsewhere. However, this will not be possible on phones and tablets (as far as I know). ---------- For folks having a similar issue in Devnagri fonts (Hindi, Marathi etc), check out this : https://sites.google.com/site/technicalhindi/home/converters. Brilliant work, but I wish someone would help them move to github. I had to customize one of their converters as the text I was dealing with had slightly differing mappings. It was a fun reverse engineering exercise. I've shared my customized converters here: http://ourpuneourbudget.in/tools/ ---------- *Why you're in luck* Non-English Unicode texts and PDF technology have a weird problem that hasn't been resolved yet. PDF has to re-arrange the character glyphs to make them appear properly. It messes the text up. Display is achieved but Fidelity is lost. So, Unicode text that goes into a PDF... may or may not make fully it back out in one piece. The degree of distortion even seems to vary across softwares and operating systems. An intervention at the PDF creating end (hence not applicable to our case) is shared here: https://bugs.documentfoundation.org/show_bug.cgi?id=66597 (find Xetex) Legacy ANSII fonts on the other hand.. retain full fidelity. You can convert a legacy fonts doc (like yours) to PDF, copy out the text and retain the original. So, since your text is in a legacy font (Nudi Akshar), you stand a chance of converting the whole thing into Unicode Kannada at the click of a button. Had it been in Unicode Kannada, you may have to manually proof-read everything and make necessary edits. ------------ For those trying to get Unicode text out of PDFs : Hope you find a way, all the best. Check out that documentfoundation link above. See past discussions on this group: https://groups.google.com/forum/#!searchin/datameet/pdf$20unicode%7Csort:date -- Cheers, Nikhil VJ +91-966-583-1250 Pune, India Website <http://nikhilvj.cu.cc> DataMeet Pune chapter <https://datameet-pune.github.io/> Self-designed learner at Swaraj University <http://www.swarajuniversity.org> Contribute <https://www.instamojo.com/@nikhilvj/> On Fri, Mar 2, 2018 at 10:19 AM, <rangan...@onlinerti.com> wrote: > I have a kannada PDF file am trying to extract the data from the PDF. But > it seems the font used is not in Unicode. I tried copy pasting the text > from PDF still the character display properly. > > I have attached the PDF also. > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.