Frederic Konrad <fred.kon...@greensocs.com> writes: > On 10/04/2015 18:03, Frederic Konrad wrote: >> On 30/03/2015 23:46, Peter Maydell wrote: >>> On 30 March 2015 at 07:52, Mark Burton <mark.bur...@greensocs.com> >>> wrote: >>>> So - Fred is unwilling to send the patch set as it stands, because >>>> frankly this part is totally broken. >>>> >>>> There is an independent patch set that needs splitting out which >>>> deals with just the atomic instruction issue - specifically for ARM >>>> (though I guess it’s applicable across the board)… >>>> >>>> So - in short - I HOPE to get the patch set onto the reflector >>>> sometime next week, and I’m sorry for the delay. >>> What I really want to see is not so much the patch set >>> but the design sketch I asked for that lists the >>> various data structures and indicates which ones >>> are going to be per-cpu, which ones will be shared >>> (and with what locking), etc. >>> >>> -- PMM > > Does that makes sense? > > BTW here is the repository: > git clone g...@git.greensocs.com:fkonrad/mttcg.git -b multi_tcg_v4
Is there a non-authenticated read-only http or git:// access to this repo? > > Thanks, > Fred > >> Hi everybody, >> Hi Peter, >> >> I tried to recap what we did, how it "works" and what the status: >> >> All the mechanism are basically unchanged. >> >> A lot of TCG structures are not thread safe. >> And all TCG threads can run at the same times and sometimes want to >> generate >> code at the same time. >> >> Translation block related structure: >> >> struct TBContext { >> >> TranslationBlock *tbs; >> TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE]; >> int nb_tbs; >> /* any access to the tbs or the page table must use this lock */ >> QemuMutex tb_lock; >> >> /* statistics */ >> int tb_flush_count; >> int tb_phys_invalidate_count; >> >> int tb_invalidated_flag; >> }; >> >> This structure is used in TCGContext: TBContext tb_ctx; >> >> "tbs" is basically where the translated block are stored and >> tb_phys_hash an >> hash table to find them quickly. >> >> There are two solutions to prevent thread issues: >> A/ Just have two tb_ctx. >> B/ Share it between CPUs and protect the tb_ctx access. >> >> We took the second solution so all CPUs can benefit of the translated TB. >> TBContext is written almost everywhere in translate-all.c. >> When there are too much tbs a tb_flush occurs and destroy the array. >> We don't >> handle this case right now. >> tb_lock is already used by user-mode code, so we just convert it to >> QemuMutex so >> we can reuse it in system-mode. >> >> struct TCGContext { >> uint8_t *pool_cur, *pool_end; >> TCGPool *pool_first, *pool_current, *pool_first_large; >> TCGLabel *labels; >> int nb_labels; >> int nb_globals; >> int nb_temps; >> >> /* goto_tb support */ >> tcg_insn_unit *code_buf; >> uintptr_t *tb_next; >> uint16_t *tb_next_offset; >> uint16_t *tb_jmp_offset; /* != NULL if USE_DIRECT_JUMP */ >> >> /* liveness analysis */ >> uint16_t *op_dead_args; /* for each operation, each bit tells if the >> corresponding argument is dead */ >> uint8_t *op_sync_args; /* for each operation, each bit tells if the >> corresponding output argument needs to be >> sync to memory. */ >> >> /* tells in which temporary a given register is. It does not take >> into account fixed registers */ >> int reg_to_temp[TCG_TARGET_NB_REGS]; >> TCGRegSet reserved_regs; >> intptr_t current_frame_offset; >> intptr_t frame_start; >> intptr_t frame_end; >> int frame_reg; >> >> tcg_insn_unit *code_ptr; >> TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */ >> TCGTempSet free_temps[TCG_TYPE_COUNT * 2]; >> >> GHashTable *helpers; >> >> #ifdef CONFIG_PROFILER >> /* profiling info */ >> int64_t tb_count1; >> int64_t tb_count; >> int64_t op_count; /* total insn count */ >> int op_count_max; /* max insn per TB */ >> int64_t temp_count; >> int temp_count_max; >> int64_t del_op_count; >> int64_t code_in_len; >> int64_t code_out_len; >> int64_t interm_time; >> int64_t code_time; >> int64_t la_time; >> int64_t opt_time; >> int64_t restore_count; >> int64_t restore_time; >> #endif >> >> #ifdef CONFIG_DEBUG_TCG >> int temps_in_use; >> int goto_tb_issue_mask; >> #endif >> >> uint16_t gen_opc_buf[OPC_BUF_SIZE]; >> TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE]; >> >> uint16_t *gen_opc_ptr; >> TCGArg *gen_opparam_ptr; >> target_ulong gen_opc_pc[OPC_BUF_SIZE]; >> uint16_t gen_opc_icount[OPC_BUF_SIZE]; >> uint8_t gen_opc_instr_start[OPC_BUF_SIZE]; >> >> /* Code generation. Note that we specifically do not use >> tcg_insn_unit >> here, because there's too much arithmetic throughout that relies >> on addition and subtraction working on bytes. Rely on the GCC >> extension that allows arithmetic on void*. */ >> int code_gen_max_blocks; >> void *code_gen_prologue; >> void *code_gen_buffer; >> size_t code_gen_buffer_size; >> /* threshold to flush the translated code buffer */ >> size_t code_gen_buffer_max_size; >> void *code_gen_ptr; >> >> TBContext tb_ctx; >> >> /* The TCGBackendData structure is private to tcg-target.c. */ >> struct TCGBackendData *be; >> }; >> >> This structure is used to translate the TBs. >> The easier solution was to protect the generation of the code to only >> allow one >> CPU to generate code at a time. This is normal as we don't want double >> generated >> tb in the pool anyway. This is achieved with the tb_lock used above. >> >> TLB: >> >> TLB seems to be CPU dependant, so it is not really a problem as in our >> implementation one CPU = one pthread. But sometimes a CPU wants to >> flush TLB, >> through an instruction for example. It is very likely an other CPU in >> an other >> thread is executing code at the same time. That's why we choose to >> create a >> tlb_flush_mechanism: >> When a CPU wants to flush it asks and wait all CPU to exit TCG and >> then exit >> itself. This can be reused for tb_invalidate and or tb_flush as well. >> >> Atomic instructions: >> >> Atomic instructions are quite hard to implement. >> The TranslationBlock implementing the atomic instruction can't be >> interrupted >> during the execution (eg: by an interrupt or a signal) cmpxchg64 >> helper is used >> for that. >> >> QEMU's global lock: >> >> TCG thread take the lock during code execution. This is not ok for >> multi-thread >> because that means only one thread will be running at a time. That's >> why we took >> Jan's patch to allow TCG to run without the lock and take it when needed. >> >> What is the status: >> >> * We can start a vexpress-a15 simulation with two A15 and run two >> dhrystones at >> a time, the performance are increased it's quite stable. >> >> What is missing: >> >> * tb_flush is not implemented correctly. >> * PageDesc structure is not protected the patch which introduced a >> first_tb >> array was not the right approach and is removed. This implies that >> tb_invalidate is broken. >> >> For both issues we plan to use the same mechanism as tlb_flush: >> exiting all the >> CPU, flushing, invalidating and let them continue. A generic mechanism >> must be >> implemented for that. >> >> Known issues: >> >> * GDB stub is broken because it uses tb_invalidate and we didn't >> implement that >> for now, and there are probably other issues. >> * SMP > 2 crashes, probably because of tb_invalidate as well. >> * We don't know the status of the user code, which is probably broken >> by our >> changes. >> -- Alex Bennée