All English characters and some Chinese words are separated by a space
----------------------------------------------------------------------
Key: PDFBOX-779
URL: https://issues.apache.org/jira/browse/PDFBOX-779
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.2.1, 1.3.0
Environment: java 1.6.0_20
pdfbox 1.2.1
fontbax 1.2.1
Reporter: Jingxuan Yu
See the pdf document and text document extracted by ExtractText.
:( Can't upload attatchments???
So, the file's info:
$ pdfinfo IKAnalyzer.pdf
Title: IKAnalyzer中文分词器V3.0使用手册
Keywords: IK Analyzer 中文分词器 Lucene
Author: 林良益、卓诗垚
Creator: WPS Office 个人版
Producer: PDFlib 7.0.3 (C++/Win32)
CreationDate: Sun Dec 6 22:07:26 2009
Tagged: no
Pages: 15
Encrypted: no
Page size: 595.3 x 841.9 pts (A4)
File size: 441273 bytes
Optimized: no
PDF version: 1.5
$ pdffonts IKAnalyzer.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
INUZMH+NSimSun-Identity-H CID TrueType yes yes yes 7 0
MGIXAY+MicrosoftYaHei-Identity-H CID TrueType yes yes yes 8 0
CFLOPA+SimSun-Identity-H CID TrueType yes yes yes 6 0
GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType yes yes yes 19 0
UNEBHT+Cambria-Bold-Identity-H CID TrueType yes yes yes 20 0
UQKWWP+Wingdings-Regular-Identity-H CID TrueType yes yes yes 33 0
NKFTTO+MicrosoftYaHei-Identity-H CID TrueType yes yes yes 40 0
OOJXDG+CourierNewPSMT-Identity-H CID TrueType yes yes yes 51 0
WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType yes yes yes 58 0
TXIHGB+Cambria-Identity-H CID TrueType yes yes yes 100 0
CRJWMD+TimesNewRomanPSMT-Identity-H CID TrueType yes yes yes 108 0
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.