[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530448#comment-17530448
 ] 

Dan Coldrick commented on TIKA-3742:
------------------------------------

Hi [~nick] 

I'm struggling, I can see there are deletes which I want to exclude from the 
parser but can't work out how to in Java.

I can see come out in DGN dump with a deleted attribute:

 
{code:java}
Element:Text         Level:27 id:19707  (DELETED) 
  offset=1959730  size=74 bytes
  graphic_group:0   color:0 weight:0 style:0
  properties=1536,MODIFIED,NEW
  origin=(963453.83000,96730.11000), rotation=272.763292
  font=1, just=2, length_mult=119.99, height_mult=119.99
  string = "HARVARD     RD" {code}
 I can see in the core element structure it should be there:

 

 
{code:java}
The first 18 words of an element in the design file are its fixed header -- 
 containing the element type, level, words to follow, and range 
 information. The C declaration for this header is as follows
 
   typedef struct
      {
      unsigned          level:6              ;            /* level element is 
on */
      unsigned          :1                   ;           /* reserved */
      unsigned          complex:1            ;          /* component of complex 
elem.*/
      unsigned          type:7               ;          /* type of element */
      unsigned          deleted:1            ;          /* set if element is 
deleted */
      unsigned short             words       ;           /* words to follow in 
element */
      unsigned long           xlow           ;            /* element range - 
low */
      unsigned long           ylow           ;
      unsigned long           zlow           ;
      unsigned long           xhigh          ;           /* element range - 
high */
      unsigned long           yhigh          ;
      unsigned long           zhigh          ;
      } Elm_hdr         {code}
 

 

You get the type out (which I think is from the same header structure)
{code:java}
int h2 = tstream.read() ;
int type = h2 & 0x7f; {code}
How do I get the deleted attribute out so I can remove it from the parse 
content? Also you said about type 37, I don't have any examples where we have 
type 37 elements.

 

I've created a fork and created some dirty code to test in:

[https://github.com/monkmachine/tika/tree/TIKA-3742/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/dgn]

 

 

> Advice around DGN7 parser and whether to add to TIKA
> ----------------------------------------------------
>
>                 Key: TIKA-3742
>                 URL: https://issues.apache.org/jira/browse/TIKA-3742
>             Project: Tika
>          Issue Type: Task
>          Components: parser
>            Reporter: Dan Coldrick
>            Priority: Minor
>         Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to